Skip to content

Grand Central Dispatch (simple loop optimization)

5 messages · Jan de Leeuw, Simon Urbanek

#
a) Obviously OpenMP is more portable. Even on a Mac I had to use  
Apple's gcc in this case
    (I normally use the GNU gcc-trunk).

b) GCD does not require specifying the number of threads -- it  
determines it at runtime.

c) Coding is simpler.

d) Since GCD is at a lower OS level than OpenMP, it will probably  
handle resource allocation
    better. But my small example, on an otherwise idle Mac Pro (16  
cores, 32 GB of RAM), does
    not really highlight that.

e) For more info, and some OpenMP comparisons, see

    http://www.macresearch.org/cocoa-scientists-xxxi-all-aboard-grand-central
    http://arstechnica.com/apple/reviews/2009/08/mac-os-x-10-6.ars/12

To quote Syracuse

"Write your application as usual, but if there's any part of its  
operation that can
reasonably be expected to take more than a few seconds to complete,  
then for the love of Zarzycki,
get it off the main thread!"
On Sep 17, 2009, at 11:03 , Saptarshi Guha wrote:

            
===
Jan de Leeuw; Distinguished Professor and Chair, UCLA Department of  
Statistics;
Director: UCLA Center for Environmental Statistics (CES);
Editor: Journal of Multivariate Analysis, Journal of Statistical  
Software;
US mail: 8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095-1554
phone (310)-825-9550;  fax (310)-206-5658;  email: deleeuw at stat.ucla.edu
.mac: jdeleeuw ++++++  aim: deleeuwjan ++++++ skype: j_deleeuw
homepages: http://gifi.stat.ucla.edu ++++++ http://www.cuddyvalley.org
   
-------------------------------------------------------------------------------------------------
           No matter where you go, there you are. --- Buckaroo Banzai
                    http://gifi.stat.ucla.edu/sounds/nomatter.au
#
Jan,

thanks for sharing this. This is really interesting. We have been  
contemplating using GCD for R (mainly pnmath) but at the time OMP was  
faster. However, GCD got apparently really good in the meantime:

 > system.time(threads(100000,1000,"omp_try"))
    user  system elapsed
   9.671   0.009   2.441
 > system.time(threads(100000,1000,"gcd_try"))
    user  system elapsed
   9.592   0.004   2.410
 > system.time(threads(100000,1000,"dcg_try"))
    user  system elapsed
   9.784   0.003   9.788

[This is on Harpertown 2.66GHz quad core]

So GCD is surprisingly just a hair faster than OMP (also surprising to  
me is that using more threads than cores make OMP faster - the above  
is with 16 threads).
On Sep 17, 2009, at 14:24 , Jan de Leeuw wrote:

            
I would not say - OMP takes just one #pragma - no need to change your  
code whereas GCD requires several special function calls... However,  
OMP is more limited in the kind of things you can do.

Cheers,
Simon
#
On Sep 17, 2009, at 15:20 , Simon Urbanek wrote:

            
Actually, with schedule(dynamic) the gap is almost at the level of the  
measurement error:

 > system.time(threads(100000,1000,"omp_try"))
    user  system elapsed
   9.614   0.006   2.420
 > system.time(threads(100000,1000,"gcd_try"))
    user  system elapsed
   9.586   0.005   2.409

-- the OMP line (to be placed before the for() loop) is 
#pragma omp parallel for default(shared) private(i) schedule(dynamic)

Cheers,
Simon
#
on my system (2 x 2.93 quad core Nehalem
with hyper-threading, so 16 threads max, 16GB RAM,
10.6.1, 64bit kernel, 64bit R)

 > system.time(threads(100000,1000,"omp"))
    user  system elapsed
  10.249   0.009   0.662
 > system.time(threads(100000,1000,"gcd"))
    user  system elapsed
  10.208   0.008   0.668
 > system.time(threads(100000,1000,"dcg"))
    user  system elapsed
   8.731   0.005   8.738

so omp == gcd, but for more complicated tasks the
tighter integration may favor gcd

comparing harpertown and nehalem --> surprising
difference (kernel ? hyper-threading ?)

i have no idea how the open-sourced gcd works on
non-mac hardware

code is downloadable using webdav from
public.me.com/jdeleeuw/software/threads
or using afp://gifi.stat.ucla.edu from
the deleeuw public directory
On Sep 17, 2009, at 12:35 , Simon Urbanek wrote:

            
===
Jan de Leeuw; Distinguished Professor and Chair, UCLA Department of  
Statistics;
Director: UCLA Center for Environmental Statistics (CES);
Editor: Journal of Multivariate Analysis, Journal of Statistical  
Software;
US mail: 8125 Math Sciences Bldg, Box 951554, Los Angeles, CA 90095-1554
phone (310)-825-9550;  fax (310)-206-5658;  email: deleeuw at stat.ucla.edu
.mac: jdeleeuw ++++++  aim: deleeuwjan ++++++ skype: j_deleeuw
homepages: http://gifi.stat.ucla.edu ++++++ http://www.cuddyvalley.org
   
-------------------------------------------------------------------------------------------------
           No matter where you go, there you are. --- Buckaroo Banzai
                    http://gifi.stat.ucla.edu/sounds/nomatter.au
#
Jan,
On Sep 17, 2009, at 16:16 , Jan de Leeuw wrote:

            
Interesting but consistent with my observations so far - Nehalems are  
not any faster than equally clocked Harpertowns (see dcg time). The  
only gains are in HT as seen in your example - my Harpertown has 4  
logical cpus, yours has 16. My 2.26GHz Nehalem is running Leopard  
(because it's the build machine ;)) but the results are similar:

 > system.time(threads(100000,1000,"omp_try"))
    user  system elapsed
  12.924   0.031   0.852
 > system.time(threads(100000,1000,"dcg_try"))
    user  system elapsed
  11.595   0.009  11.608

Again, the sequential time is about the same as on equally clocked  
Harpertown, but the HT helps with a factor of over 13. That explains  
where the alleged performance boost on Nehalems comes from ...

It would be interesting to run OMP pnmath with schedule(dynamic) on a  
8-core Nehalem and compare that with a stock R ... (pnmath will need a  
bit of tweaking because it attempts to be too smart on the number of  
threads). Clearly, on many short operations it may cause a hit, but  
the gain on long vectors is up to 16 which is impressive ...

Cheers,
Simon