Skip to content

FAQ? Mac distributed/multiple processor solutions?

8 messages · Rob Forsyth, Richard Pearson, Kasper Daniel Hansen +4 more

#
I felt this must be an FAQ but I don't see it anywhere: my apologies  
if I've missed it.

I have what I believe is known as an "embarrassingly parallel"  
problem comprising a large number of repetitions of a single  
(lengthy) calculation that generates a boolean result and I am simply  
interested in the final proportion of true to false runs. This  
obviously lends itself to parallel computation and I'd appreciate Mac- 
specific pointers to both simple distributed and multiple-processor  
options here. I am working on an iMac G5 and could access (at home -  
i.e. not on a LAN) another G5 and a G4. I've come across the R/MPI  
package but would appreciate advice as to how easy this is to set up  
(would it actually be simpler to divide the job "manually"?).  
Alternatively I have an option to acquire a MacPro for this work and  
would appreciate guidance as to whether it's possible to leverage  
multiple processors?  I'm aware R itself is not currently  
multithreaded (whilst having only a lay understanding of what that  
means).
#
Hi Rob

You might want to look at the snow package. This can be used either with 
Rmpi or without (using socket connections). I've successfully used this 
for speeding up things on multi-node clusters, and also on a single 
multi-core mac. I've included some brief instructions on getting things 
working in chapter 6 of the user guide for my (bioconductor) package 
puma - hope this helps!

Richard.
Rob Forsyth wrote:
#
On Dec 4, 2007, at 12:52 AM, Rob Forsyth wrote:
One important thing here: while R is not multithreaded, on Mac OS X,  
R uses a special BLAS which is multithreaded. So anything involving  
linear algebra (which for some problems is a major part of the  
computational load), will benefit from having multiple CPUs in the  
same machine. Depending on your problem this may be indeed speed up  
your things.

Kasper
#
On Tue, 4 Dec 2007, Rob Forsyth wrote:
With only three computers it would be easiest to divide the job manually.

 	-thomas
#
For a one-time run this is true, but if you find yourself doing this  
(or similar things) often, it can be a nuisance to break it up every  
time. The snow package is fairly painless to install and works great.

I found, however, that if you have things (eg R, LAM/MPI, ...)  
installed outside of the default Mac OS X path when connecting with  
ssh that they won't run unless you add (or uncomment?)  
?PermitUserEnvironment yes? to /private/etc/sshd_con?g in order  
to use the modified path (set in .bash_profile) on the other OS X  
machines. Perhaps there is a better way, but that is how I got it  
working. Although I think this is the default setting on Leopard (and  
would only be an issue for earlier versions).

Best,
Randy
On Dec 4, 2007, at 1:57 PM, Thomas Lumley wrote:

            
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Randall C Johnson
Bioinformatics Analyst
SAIC-Frederick, Inc (Contractor)
Laboratory of Genomic Diversity
NCI-Frederick, P.O. Box B
Bldg 560, Rm 11-85
Frederick, MD 21702
Phone: (301) 846-1304
Fax: (301) 846-1686
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 day later
#
On 05/12/2007, at 5:57 AM, Thomas Lumley wrote:

            
Much easier, and if you have access to a machine with multiple  
processors, simply duplicate the R process to have the same number as  
the number of processors, and then run them simultaneously. Not as  
elegant and maybe not as efficient as other methods, but effective.

Ken
#
On Thu, 6 Dec 2007, Ken Beath wrote:

            
But those processes need to do different things (and record the results in 
different files), which is what Thomas means by 'divide the job manually'.

Incidentally, I find it useful to run slightly more R processes than the 
number of processors, to ensure full CPU usage when one of the processes 
is in an I/O wait or hits a swapping trap.  (Provided you have ample RAM 
or you will get additional swapping.)

Even with many processors it may be easisest to do this manually.  Our 
geneticists do simulation-based inference by running separate simulation 
runs on up to 100s of processors simultaneously: the scheduler works 
better with independent jobs.
#
On 06/12/2007, at 9:13 PM, Prof Brian Ripley wrote:

            
I meant duplicate the R application, using the Finder. I thought it  
was unnecessary to mention that it will require a different set of  
commands to be run in each copy of R.

Ken