Skip to content
Prev 183 / 2152 Next

distributed R on EC2, designing the software stack

Hello,
Yes, I was playing with EC2 and Rhipe last night. Just got permission
to increase my instances to 100!
The details (what I know)
RHIPE is based on Hadoop and R. Cloudera has a very easy to use AMI
for small and large (32/64bit) instances. It is easy enough to install
the cloudera AMI.
However It does not come with R.

Last night, I modified their scripts to yum install R (using yum we
get R-2.6) on each machine - as such this results in ~21MB downloads
on the machines[1], which is not expensive but is not the best way do
things.

Once booted, each machine installs R, Rserve and one machine (the
master) installs RHIPE.
I did it with 1 master and 1 tasktracker and RHIPE worked. I intend to
check with 30+ instances to see how things scale.

I have emailed cloudera asking them to bundle R with their Hadoop AMI
- so that users incur a minimal expense.

I will be placing EC2 instructions to use RHIPE shortly this week.
Given the reasonable cost of EC2, it would be a great way for users to
test out distributed computing with R. Maybe as part of the R
community we could host a linux AMI? Again, cost is the issue
here(rather not pay for users downloading things)

Regards
Saptarshi Guha
[1] Not quite sure how the AMI's work - if 10 AMIs belong to one
group, does EC2 boot up one and replicate the booted instance? If so,
then there is only one download, if not each machine downloads.


On Wed, Apr 29, 2009 at 3:24 PM, Whit Armstrong
<armstrong.whit at gmail.com> wrote: