Skip to content

HPC with standard R functions

2 messages · Simone Ruzza, Brian G. Peterson

#
Dear list,

apologies for the total beginner's question, but I am very new to HPC.
I am confronted with a large data analysis job that requires using
functions available for contributed packages, that I did not write
myself.
I would like to speed up the process of analysis and I am considering
parallel computing or a cluster. As far as I understand, it is that it
is not it is always possible to parallelize R code to be executed on a
cluster. This depends on the computing task i.e. whether it is
iterative.  My question is: is it possible to speed up the execution
time of a function (e.g. some model fitting function), which includes
low-level functions? I am not looking for any solutions that I have
already found on the web that show for example, how to use the
snowfall package (e.g. use sfLapply) to perform an iterative task. In
my case it appears that I would have to re-write a large amount of
code myself, which to me seems to be equivalent to re-inventing the
wheel.  Apologies for the generality of my question, due to my
ignorance on the subject. Any help would be greatly appreciated!

Best wishes,

Simone
#
On 09/28/2013 01:19 PM, Simone Ruzza wrote:
I'm not sure you've told us enough to answer you.

If your task is repetitive (such as Monte Carlo analysis), then the 
answer is most likely yes.

If your data can be partitioned, and your model can be fit on the 
partitions, then the answer is most likely yes, you can parallelize it.

If your model can be partitioned, so that some or all of the 
sub-functions from other packages that you mention can be called in 
parallel on your large data, then the answer is most likely yes.

In terms of technology to use, at this point you'd have to tell us about 
the cluster you want to run it on, which would then help us decide 
whether you should be looking at 'parallel',now part of base R, 
'foreach' which has what I believe to be the very nice property of 
writing code that can use any or no parallel backends without changing 
your code, or something very specific like Rmpi because the cluster you 
hope to use uses that as its parallel backend. (there are other possible 
endpoints too, but these seem to be the most popular)

But from what I read above, you haven't given us enough detail about 
what you need to do for me at least to say anything definitive.

Regards,

Brian