Greetings, I am trying to get into distributed computing with R, but do not have access to a cluster. Therefore, I am trying to get distributed R running on Amazon's EC2. ( http://aws.amazon.com/ec2/ ) For those of you who don't know, EC2 allows you to instantiate large numbers of computers, bundled with whatever OS and software configuration you want. From my survey of things, there are a lot of different options available for distributed computing. For my needs, I would just like to run simple Monte Carlo simulations, and other things that don't require a ton of inter-node communication. What I would like to do is put together a public AMI and a howto guide, such that it would be very easy for anyone to instantiate an N-node cluster and start with parallel computing. I would like to have a discussion/brainstorm over what the exact software stack should be. My initial thoughts were: 1) R 2.9.0 + OpenMPI + RMpi + Snowfall/sfCluster - will Amazon's network work with OpenMPI. Perhaps it would be better to use PVM or something that is more tolerant to non-optimal network 2) R 2.9.0 + "socket based communication" + Snowfall/sfCluster - is this scalable 3) R 2.9.0 + twisted + NetWorkSpaces - not sure of Amazon's network supports broadcast mode, which is required by twisted 4) Biocep-R - this looks like it has the functionality to do what I want, but a lot of other stuff as well. 5) RHIPE - Hadoop is well supported by EC2. Perhaps this is the way to go. Seems like a very new package :) What are people's thoughts on what would be a good software stack with the constraint that it should be simple and run on EC2? Thanks, -stephen ========================================== Stephen J. Barr University of Washington WEB: www.econsteve.com
distributed R on EC2, designing the software stack
5 messages · Stephen J. Barr, Whit Armstrong, Saptarshi Guha +2 more
you should contact Robert Grossman who just gave a presentation on this topic at R/Finance in Chicago. link: http://rinfinance.quantmod.com/speakers/ -Whit
On Wed, Apr 29, 2009 at 3:06 PM, Stephen J. Barr <stephenjbarr at gmail.com> wrote:
Greetings, I am trying to get into distributed computing with R, but do not have access to a cluster. Therefore, I am trying to get distributed R running on Amazon's EC2. ( http://aws.amazon.com/ec2/ ) For those of you who don't know, EC2 allows you to instantiate large numbers of computers, bundled with whatever OS and software configuration you want. From my survey of things, there are a lot of different options available for distributed computing. For my needs, I would just like to run simple Monte Carlo simulations, and other things that don't require a ton of inter-node communication. What I would like to do is put together a public AMI and a howto guide, such that it would be very easy for anyone to instantiate an N-node cluster and start with parallel computing. I would like to have a discussion/brainstorm over what the exact software stack should be. My initial thoughts were: 1) R 2.9.0 + OpenMPI + RMpi + Snowfall/sfCluster ? - will Amazon's network work with OpenMPI. Perhaps it would be better to use PVM or something that is more tolerant to non-optimal network 2) ?R 2.9.0 + "socket based communication" + Snowfall/sfCluster ?- is this scalable 3) ?R 2.9.0 + twisted + NetWorkSpaces ? - not sure of Amazon's network supports broadcast mode, which is required by twisted 4) Biocep-R ? - this looks like it has the functionality to do what I want, but a lot of other stuff as well. 5) RHIPE ? - Hadoop is well supported by EC2. Perhaps this is the way to go. Seems like a very new package :) What are people's thoughts on what would be a good software stack with the constraint that it should be simple and run on EC2? Thanks, -stephen ========================================== Stephen J. Barr University of Washington WEB: www.econsteve.com
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Hello, Yes, I was playing with EC2 and Rhipe last night. Just got permission to increase my instances to 100! The details (what I know) RHIPE is based on Hadoop and R. Cloudera has a very easy to use AMI for small and large (32/64bit) instances. It is easy enough to install the cloudera AMI. However It does not come with R. Last night, I modified their scripts to yum install R (using yum we get R-2.6) on each machine - as such this results in ~21MB downloads on the machines[1], which is not expensive but is not the best way do things. Once booted, each machine installs R, Rserve and one machine (the master) installs RHIPE. I did it with 1 master and 1 tasktracker and RHIPE worked. I intend to check with 30+ instances to see how things scale. I have emailed cloudera asking them to bundle R with their Hadoop AMI - so that users incur a minimal expense. I will be placing EC2 instructions to use RHIPE shortly this week. Given the reasonable cost of EC2, it would be a great way for users to test out distributed computing with R. Maybe as part of the R community we could host a linux AMI? Again, cost is the issue here(rather not pay for users downloading things) Regards Saptarshi Guha [1] Not quite sure how the AMI's work - if 10 AMIs belong to one group, does EC2 boot up one and replicate the booted instance? If so, then there is only one download, if not each machine downloads. On Wed, Apr 29, 2009 at 3:24 PM, Whit Armstrong
<armstrong.whit at gmail.com> wrote:
you should contact Robert Grossman who just gave a presentation on this topic at R/Finance in Chicago. link: http://rinfinance.quantmod.com/speakers/ -Whit On Wed, Apr 29, 2009 at 3:06 PM, Stephen J. Barr <stephenjbarr at gmail.com> wrote:
Greetings, I am trying to get into distributed computing with R, but do not have access to a cluster. Therefore, I am trying to get distributed R running on Amazon's EC2. ( http://aws.amazon.com/ec2/ ) For those of you who don't know, EC2 allows you to instantiate large numbers of computers, bundled with whatever OS and software configuration you want. From my survey of things, there are a lot of different options available for distributed computing. For my needs, I would just like to run simple Monte Carlo simulations, and other things that don't require a ton of inter-node communication. What I would like to do is put together a public AMI and a howto guide, such that it would be very easy for anyone to instantiate an N-node cluster and start with parallel computing. I would like to have a discussion/brainstorm over what the exact software stack should be. My initial thoughts were: 1) R 2.9.0 + OpenMPI + RMpi + Snowfall/sfCluster - will Amazon's network work with OpenMPI. Perhaps it would be better to use PVM or something that is more tolerant to non-optimal network 2) R 2.9.0 + "socket based communication" + Snowfall/sfCluster - is this scalable 3) R 2.9.0 + twisted + NetWorkSpaces - not sure of Amazon's network supports broadcast mode, which is required by twisted 4) Biocep-R - this looks like it has the functionality to do what I want, but a lot of other stuff as well. 5) RHIPE - Hadoop is well supported by EC2. Perhaps this is the way to go. Seems like a very new package :) What are people's thoughts on what would be a good software stack with the constraint that it should be simple and run on EC2? Thanks, -stephen ========================================== Stephen J. Barr University of Washington WEB: www.econsteve.com
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
On 29 April 2009 at 12:06, Stephen J. Barr wrote:
| 1) R 2.9.0 + OpenMPI + RMpi + Snowfall/sfCluster | - will Amazon's network work with OpenMPI. Perhaps it would be | better to use PVM or something that is more tolerant to non-optimal | network If you can use standard snow rather snowfall/sfcluster, then (I believe) you are done. As per some emails on the Open MPI list from last fall or summer, you get Debian / Ubuntu instances where all this is just an 'apt-get install' or two away given the set of packages I maintain for Debian. Plus you get slurm to control it. | 2) R 2.9.0 + "socket based communication" + Snowfall/sfCluster | - is this scalable Likewise, snow and sockets works as is on Debian / Ubuntu. | 3) R 2.9.0 + twisted + NetWorkSpaces | - not sure of Amazon's network supports broadcast mode, which is | required by twisted Should also works out-of-the box via the r-cran-nws and python-nwsserver package I maintain. | 4) Biocep-R | - this looks like it has the functionality to do what I want, but a | lot of other stuff as well. Yep, but I haven't had a chance to look more closely. | 5) RHIPE | - Hadoop is well supported by EC2. Perhaps this is the way to go. | Seems like a very new package :) Yes, and there is more Hadoop stuff cooking on R-Forge. | What are people's thoughts on what would be a good software stack with | the constraint that it should be simple and run on EC2? I use the computer hanging around the house. If you have a desktop and a laptop, you are ready to go. Or if you have enough ram, you can try virtual approaches as well. Last time I tried (for my HPC tutorials) the networking was fully 'see-through' yet though I hear that VirtualBox improved there. Let us know what you come up with. Dirk
Three out of two people have difficulties with fractions.
On Wed, Apr 29, 2009 at 3:39 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
On 29 April 2009 at 12:06, Stephen J. Barr wrote:
| 3) ?R 2.9.0 + twisted + NetWorkSpaces | ? ?- not sure of Amazon's network supports broadcast mode, which is | required by twisted Should also works out-of-the box via the r-cran-nws and python-nwsserver package I maintain.
Note that the way NetWorkSpaces uses twisted does not require any special broadcast mode.
Steve Weston REvolution Computing One Century Tower | 265 Church Street, Suite 1006 New Haven, CT 06510 P: 203-777-7442 x266 | www.revolution-computing.com