Dear all, does anyone know if there exists an effort to bring some kind of distributed computing to R? The most simple functionality I'm after is to be able to explicitly perform a task on a computing server. Sorry if this is a non-informed newbie question... Best regards Anders Sj?gren PhD Student Dept. of Mathematical Statistics Chalmers University of Technology Gothenburg, Sweden
Distributed computing
10 messages · Anders Sjögren, Byron Ellis, Tony Plate +5 more
http://www.analytics.washington.edu/statcomp/ You probably want to look at either Rpvm or the R Statistical Server (Python/SOAP)
On Mar 22, 2004, at 4:21 AM, Anders Sj?gren wrote:
Dear all, does anyone know if there exists an effort to bring some kind of distributed computing to R? The most simple functionality I'm after is to be able to explicitly perform a task on a computing server. Sorry if this is a non-informed newbie question... Best regards Anders Sj?gren PhD Student Dept. of Mathematical Statistics Chalmers University of Technology Gothenburg, Sweden
______________________________________________ R-devel@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
--- Byron Ellis (bellis@hsph.harvard.edu) "Oook" -- The Librarian
The R-help list is probably better for this sort of question. In any case,
the R newsletter is a great source of accumulated (& distilled!) wisdom,
and the following might be relevant.
@Article{Rnews:Li+Rossini:2001,
author = {Michael Na Li and A.J. Rossini},
title = {{RPVM}: Cluster Statistical Computing in {R}},
journal = {R News},
year = 2001,
volume = 1,
number = 3,
pages = {4--7},
month = {September},
url = {http://CRAN.R-project.org/doc/Rnews/}
}
@Article{Rnews:Yu:2002,
author = {Hao Yu},
title = {Rmpi: Parallel Statistical Computing in R},
journal = {R News},
year = 2002,
volume = 2,
number = 2,
pages = {10--14},
month = {June},
url = {http://CRAN.R-project.org/doc/Rnews/}
}
@Article{Rnews:Carson+Murison+Mason:2003,
author = {Brett Carson and Robert Murison and Ian A. Mason},
title = {Computational Gains Using {RPVM} on a Beowulf Cluster},
journal = {R News},
year = 2003,
volume = 3,
number = 1,
pages = {21--26},
month = {June},
url = {http://CRAN.R-project.org/doc/Rnews/}
}
hope this helps,
Tony Plate
At Monday 02:21 AM 3/22/2004, Anders Sjögren wrote:
Dear all, does anyone know if there exists an effort to bring some kind of distributed computing to R? The most simple functionality I'm after is to be able to explicitly perform a task on a computing server. Sorry if this is a non-informed newbie question... Best regards Anders Sjögren PhD Student Dept. of Mathematical Statistics Chalmers University of Technology Gothenburg, Sweden
______________________________________________ R-devel@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
1 day later
does anyone know if there exists an effort to bring some kind of distributed computing to R? The most simple functionality I'm after is to be able to explicitly perform a task on a computing server. Sorry if this is a non-informed newbie question...
As an alternate to the PVM/MPI interfaces mentioned by other people, I am working on a (very soon to be released) project for using the ScaLAPACK library [1] through a simple R interface. If the tasks that you want run an a computing server are simple (LAPACK) functions (solve, svd, etc) and not whole R scripts, then this be useful. David Bauer [1] http://www.netlib.org/scalapack/
gte810u@mail.gatech.edu writes:
does anyone know if there exists an effort to bring some kind of distributed computing to R? The most simple functionality I'm after is to be able to explicitly perform a task on a computing server. Sorry if this is a non-informed newbie question...
As an alternate to the PVM/MPI interfaces mentioned by other people, I am working on a (very soon to be released) project for using the ScaLAPACK library [1] through a simple R interface. If the tasks that you want run an a computing server are simple (LAPACK) functions (solve, svd, etc) and not whole R scripts, then this be useful.
A number of folks have commented on having this in progress (esp a group at Vanderbilt). It's intriguing, but how did you plan on replacing the standard system-level library calls? (or did you just provide new interfaces at the user (R command) level?) best, -tony
rossini@u.washington.edu http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}}
My inclination would be to, whenever possible, replace the core scalar libraries with compatible parallel versions (lapack -> scalapack), rather than make it an add-on package. If the R client code is general enough, and the make file can automatically find the parallel version, then its a simple matter of compiling with the parallel libs. (Don't know if this is possible at run-time.) No rewriting (high level) R code at all. I tried to contact the plapack folks here at UT about integrating with R, but it appears the project is no longer active.
Unfortunately, there is a major complication to this approach: the distribution of data. ScaLAPACK (and PLAPACK) requires the data to be distributed in a special way before calculation functions can be called. Given a generic R matrix, we have to distribute the data before we can call ScaLAPACK functions on it. We then have to collect the answer before we can return it to R. Because of this serious overhead, replacing all LAPACK calls with ScaLAPACK calls would not be recommended. Future versions of our package [1] may include some type of automatic benchmarking to decide when problems are large enough to be worth sending to ScaLAPACK. David Bauer [1] http://www.aspect-sdm.org/Parallel-R/
My inclination would be to, whenever possible, replace the core scalar libraries with compatible parallel versions (lapack -> scalapack), rather than make it an add-on package. If the R client code is general enough, and the make file can automatically find the parallel version, then its a simple matter of compiling with the parallel libs. (Don't know if this is possible at run-time.) No rewriting (high level) R code at all. I tried to contact the plapack folks here at UT about integrating with R, but it appears the project is no longer active. Tim
On Tue, 2004-03-23 at 13:32, A.J. Rossini wrote:
gte810u@mail.gatech.edu writes:
does anyone know if there exists an effort to bring some kind of distributed computing to R? The most simple functionality I'm after is to be able to explicitly perform a task on a computing server. Sorry if this is a non-informed newbie question...
As an alternate to the PVM/MPI interfaces mentioned by other people, I am working on a (very soon to be released) project for using the ScaLAPACK library [1] through a simple R interface. If the tasks that you want run an a computing server are simple (LAPACK) functions (solve, svd, etc) and not whole R scripts, then this be useful.
A number of folks have commented on having this in progress (esp a group at Vanderbilt). It's intriguing, but how did you plan on replacing the standard system-level library calls? (or did you just provide new interfaces at the user (R command) level?) best, -tony
Timothy H. Keitt Section of Integrative Biology University of Texas at Austin http://www.keittlab.org/
Sorry for posing the question to the wrong list. Apart from the public replies I also got tips on the packages snow and rmpi (http://www.stats.uwo.ca/faculty/yu/Rmpi/). Thanks for the pieces of advice. -> Anders ---
On Mar 22, 2004, at 6:00 PM, Tony Plate wrote:
The R-help list is probably better for this sort of question.? In any
case, the R newsletter is a great source of accumulated (& distilled!)
wisdom, and the following might be relevant.
@Article{Rnews:Li+Rossini:2001,
? author?????? = {Michael Na Li and
A.J. Rossini},
? title??????? =
{{RPVM}: Cluster Statistical Computing in {R}},
? journal????? = {R News},
?
year???????? =
2001,
? volume?????? = 1,
? number?????? = 3,
? pages??????? =
{4--7},
? month??????? =
{September},
?
url?????????
=
{http://CRAN.R-project.org/doc/Rnews/}
}
@Article{Rnews:Yu:2002,
? author?????? = {Hao Yu},
? title??????? =
{Rmpi: Parallel Statistical Computing in R},
? journal????? = {R News},
?
year???????? =
2002,
? volume?????? = 2,
? number?????? = 2,
? pages??????? =
{10--14},
? month??????? =
{June},
?
url?????????
=
{http://CRAN.R-project.org/doc/Rnews/}
}
@Article{Rnews:Carson+Murison+Mason:2003,
? author?????? = {Brett Carson and
Robert Murison and Ian A. Mason},
? title??????? =
{Computational Gains Using {RPVM} on a Beowulf Cluster},
? journal????? = {R News},
?
year???????? =
2003,
? volume?????? = 3,
? number?????? = 1,
? pages??????? =
{21--26},
? month??????? =
{June},
?
url?????????
=
{http://CRAN.R-project.org/doc/Rnews/}
}
hope this helps,
Tony Plate
At Monday 02:21 AM 3/22/2004, Anders Sj?gren wrote:
Dear all,
does anyone know if there exists an effort to bring some kind of
distributed computing to R? The most simple functionality I'm after is
to be able to explicitly perform a task on a computing server. Sorry
if this is a non-informed newbie question...
Best regards
Anders Sj?gren
PhD Student
Dept. of Mathematical Statistics
Chalmers University of Technology
Gothenburg, Sweden
______________________________________________ R-devel@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
Fei Chen implemented distribution of data and ScaLAPACK as part of his DPhil thesis, with a high-level R interface. Moving data around is often the major limiting factor on large-scale model fitting (he was experimenting with glm's). There are two brief papers at http://www.isi-2003.de/guest/3427.pdf?MItabObj=pcoabstract&MIcolObj=uploadpaper&MInamObj=id&MIvalObj=3427&MItypeObj=application/pdf adn in the DSC2003 proceedings (but the ci.tuwien server is currently not available, at least from here). Now Fei's process is complete, perhaps he will make the thesis available on line.
On Tue, 23 Mar 2004 gte810u@mail.gatech.edu wrote:
Quoting someone unamed! --
My inclination would be to, whenever possible, replace the core scalar libraries with compatible parallel versions (lapack -> scalapack), rather than make it an add-on package. If the R client code is general enough, and the make file can automatically find the parallel version, then its a simple matter of compiling with the parallel libs. (Don't know if this is possible at run-time.) No rewriting (high level) R code at all. I tried to contact the plapack folks here at UT about integrating with R, but it appears the project is no longer active.
Unfortunately, there is a major complication to this approach: the distribution of data. ScaLAPACK (and PLAPACK) requires the data to be distributed in a special way before calculation functions can be called. Given a generic R matrix, we have to distribute the data before we can call ScaLAPACK functions on it. We then have to collect the answer before we can return it to R. Because of this serious overhead, replacing all LAPACK calls with ScaLAPACK calls would not be recommended. Future versions of our package [1] may include some type of automatic benchmarking to decide when problems are large enough to be worth sending to ScaLAPACK. David Bauer [1] http://www.aspect-sdm.org/Parallel-R/
______________________________________________ R-devel@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Thanks Brian for pointing this out... Yes indeed my thesis involved distributed computing and R. It consisted of two parts, a distributed scoping feature for limiting data movements, and a parrallel computing interface for speeding up computations. The former used CORBA and the latter PVM (plus embedded R-s and ScaLAPACK). There are three documents available describing this in more detail http://www.stats.ox.ac.uk/~feic/Rs/thesis.pdf my thesis http://www.stats.ox.ac.uk/~feic/Rs/shorter.pdf a shorter summary http://www.stats.ox.ac.uk/~feic/Rs/DSC2003.pdf the DSC document Brian pointed out. I haven't publicized this mainly because the distributed scoping piece involved modifying internal R code, most notably the R_eval() function, which is a bit non-portable... But if there's interest in how I did things I can certainly clean up my code and make it available. The parallel engine part uses standard R so it should be easier to set up. Cheers, fei
On Wed, 24 Mar 2004, Prof Brian Ripley wrote:
Fei Chen implemented distribution of data and ScaLAPACK as part of his DPhil thesis, with a high-level R interface. Moving data around is often the major limiting factor on large-scale model fitting (he was experimenting with glm's). There are two brief papers at http://www.isi-2003.de/guest/3427.pdf?MItabObj=pcoabstract&MIcolObj=uploadpaper&MInamObj=id&MIvalObj=3427&MItypeObj=application/pdf adn in the DSC2003 proceedings (but the ci.tuwien server is currently not available, at least from here). Now Fei's process is complete, perhaps he will make the thesis available on line. On Tue, 23 Mar 2004 gte810u@mail.gatech.edu wrote: Quoting someone unamed! --
My inclination would be to, whenever possible, replace the core scalar libraries with compatible parallel versions (lapack -> scalapack), rather than make it an add-on package. If the R client code is general enough, and the make file can automatically find the parallel version, then its a simple matter of compiling with the parallel libs. (Don't know if this is possible at run-time.) No rewriting (high level) R code at all. I tried to contact the plapack folks here at UT about integrating with R, but it appears the project is no longer active.
Unfortunately, there is a major complication to this approach: the distribution of data. ScaLAPACK (and PLAPACK) requires the data to be distributed in a special way before calculation functions can be called. Given a generic R matrix, we have to distribute the data before we can call ScaLAPACK functions on it. We then have to collect the answer before we can return it to R. Because of this serious overhead, replacing all LAPACK calls with ScaLAPACK calls would not be recommended. Future versions of our package [1] may include some type of automatic benchmarking to decide when problems are large enough to be worth sending to ScaLAPACK. David Bauer [1] http://www.aspect-sdm.org/Parallel-R/
______________________________________________ R-devel@stat.math.ethz.ch mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-devel
-- Brian D. Ripley, ripley@stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595