Skip to content

Distributed computing

10 messages · Anders Sjögren, Byron Ellis, Tony Plate +5 more

#
Dear all,

does anyone know if there exists an effort to bring some kind of 
distributed computing to R? The most simple functionality I'm after is 
to be able to explicitly perform a task on a computing server. Sorry if 
this is a non-informed newbie question...

Best regards

Anders Sj?gren

PhD Student
Dept. of Mathematical Statistics
Chalmers University of Technology
Gothenburg, Sweden
#
http://www.analytics.washington.edu/statcomp/

You probably want to look at either Rpvm or the R Statistical Server 
(Python/SOAP)
On Mar 22, 2004, at 4:21 AM, Anders Sj?gren wrote:

            
---
Byron Ellis (bellis@hsph.harvard.edu)
"Oook" -- The Librarian
#
The R-help list is probably better for this sort of question.  In any case, 
the R newsletter is a great source of accumulated (& distilled!) wisdom, 
and the following might be relevant.


@Article{Rnews:Li+Rossini:2001,
   author       = {Michael Na Li and A.J. Rossini},
   title        = {{RPVM}: Cluster Statistical Computing in {R}},
   journal      = {R News},
   year         = 2001,
   volume       = 1,
   number       = 3,
   pages        = {4--7},
   month        = {September},
   url          = {http://CRAN.R-project.org/doc/Rnews/}
}

@Article{Rnews:Yu:2002,
   author       = {Hao Yu},
   title        = {Rmpi: Parallel Statistical Computing in R},
   journal      = {R News},
   year         = 2002,
   volume       = 2,
   number       = 2,
   pages        = {10--14},
   month        = {June},
   url          = {http://CRAN.R-project.org/doc/Rnews/}
}


@Article{Rnews:Carson+Murison+Mason:2003,
   author       = {Brett Carson and Robert Murison and Ian A. Mason},
   title        = {Computational Gains Using {RPVM} on a Beowulf Cluster},
   journal      = {R News},
   year         = 2003,
   volume       = 3,
   number       = 1,
   pages        = {21--26},
   month        = {June},
   url          = {http://CRAN.R-project.org/doc/Rnews/}
}


hope this helps,

Tony Plate
At Monday 02:21 AM 3/22/2004, Anders Sjögren wrote:

  
  
1 day later
#
As an alternate to the PVM/MPI interfaces mentioned by other people, I am
working on a (very soon to be released) project for using the ScaLAPACK library
[1] through a simple R interface.  If the tasks that you want run an a computing
server are simple (LAPACK) functions (solve, svd, etc) and not whole R scripts,
then this be useful.


David Bauer

[1] http://www.netlib.org/scalapack/
#
gte810u@mail.gatech.edu writes:
A number of folks have commented on having this in progress (esp a
group at Vanderbilt).  It's intriguing, but how did you plan on
replacing the standard system-level library calls?  (or did you just
provide new interfaces at the user (R command) level?)

best,
-tony
#
Unfortunately, there is a major complication to this approach:  the distribution
of data.  ScaLAPACK (and PLAPACK) requires the data to be distributed in a
special way before calculation functions can be called.  Given a generic R
matrix, we have to distribute the data before we can call ScaLAPACK functions on
it.  We then have to collect the answer before we can return it to R.  Because
of this serious overhead, replacing all LAPACK calls with ScaLAPACK calls would
not be recommended.  Future versions of our package [1] may include some type of
automatic benchmarking to decide when problems are large enough to be worth
sending to ScaLAPACK.


David Bauer

[1] http://www.aspect-sdm.org/Parallel-R/
#
My inclination would be to, whenever possible, replace the core scalar
libraries with compatible parallel versions (lapack -> scalapack),
rather than make it an add-on package. If the R client code is general
enough, and the make file can automatically find the parallel version,
then its a simple matter of compiling with the parallel libs. (Don't
know if this is possible at run-time.) No rewriting (high level) R code
at all. I tried to contact the plapack folks here at UT about
integrating with R, but it appears the project is no longer active.

Tim
On Tue, 2004-03-23 at 13:32, A.J. Rossini wrote:
#
Sorry for posing the question to the wrong list. Apart from the public 
replies I also got tips on the packages snow and rmpi 
(http://www.stats.uwo.ca/faculty/yu/Rmpi/).

Thanks for the pieces of advice.

-> Anders

---
On Mar 22, 2004, at 6:00 PM, Tony Plate wrote:

            
#
Fei Chen implemented distribution of data and ScaLAPACK as part of his 
DPhil thesis, with a high-level R interface.  Moving data around is often 
the major limiting factor on large-scale model fitting (he was 
experimenting with glm's).

There are two brief papers at

http://www.isi-2003.de/guest/3427.pdf?MItabObj=pcoabstract&MIcolObj=uploadpaper&MInamObj=id&MIvalObj=3427&MItypeObj=application/pdf

adn in the DSC2003 proceedings  (but the ci.tuwien server is currently not 
available, at least from here).

Now Fei's process is complete, perhaps he will make the thesis available 
on line.
On Tue, 23 Mar 2004 gte810u@mail.gatech.edu wrote:
Quoting someone unamed! --

  
    
#
Thanks Brian for pointing this out...

Yes indeed my thesis involved distributed computing and R. It consisted of
two parts, a distributed scoping feature for limiting data movements, and
a parrallel computing interface for speeding up computations. The former
used CORBA and the latter PVM (plus embedded R-s and ScaLAPACK).

There are three documents available describing this in more detail

http://www.stats.ox.ac.uk/~feic/Rs/thesis.pdf
my thesis

http://www.stats.ox.ac.uk/~feic/Rs/shorter.pdf
a shorter summary

http://www.stats.ox.ac.uk/~feic/Rs/DSC2003.pdf
the DSC document Brian pointed out.

I haven't publicized this mainly because the distributed scoping piece
involved modifying internal R code, most notably the R_eval() function,
which is a bit non-portable... But if there's interest in how I did things
I can certainly clean up my code and make it available. The parallel
engine part uses standard R so it should be easier to set up.

Cheers,

fei
On Wed, 24 Mar 2004, Prof Brian Ripley wrote: