request for input on a new parallel R package using Amazon Web Services
I've received a lot of good private feedback on this package. Thank you all so much! One note pointed out a bug that kept my example from running. Sorry about that. I've patched the code and updated the tar ball on the Google Code site: http://code.google.com/p/segue/ Please note that I have not even once run this code from a Windows or Mac local machine. I wrote the code with intent for it to be cross platform, but my main machine is Ubuntu Linux, so what little testing that has been done, has been done in Linux. Thanks again for all your support and helpful comments. -J
On Wed, Dec 22, 2010 at 9:16 PM, James Long <jdlong at gmail.com> wrote:
Dear R-HPC list: About 7 months ago I presented at the Chicago Hadoop User Group an example of using Amazon's Elastic Map Reduce (EMR) as a method of running R in parallel. If you're curious about such things, here's the video from my presentation: http://www.vcasmo.com/video/drewconway/8468 Since then I've been working to grossly simplify the use of Amazon Web Services as an R parallel engine. Toward that end I've created an abstraction on top of AWS which I've named "Segue." It includes an lapply() type function called emrlapply() which runs an lapply across an array of Amazon machines. I'm not a professional developer and am actually somewhat new to parallel computing. But this project spun from my own need to parallel R for Monte Carlo modeling and I don't have access to an MPI cluster. So this is dog food which I've been busy eating. I'd really appreciate some input from all of you who have been doing this type of thing a lot longer. Please keep in mind that this package is VERY alpha. I've run a few tests and things work. But the wheels might pop off and odd things might happen. If you use it, you may end up with temp directories in your S3 account and be sure and double check if EC2 instances really shut down or else Amazon will bill you for the running machines. Please keep in mind that the use case for this package is people who, like myself, don't have access to their own cluster and would like to easily rent one from Amazon (emphasis on _easily_) for their CPU bound tasks. This is not a "big data" package because at each run of emrlapply() the list is serialized and uploaded to S3. The list must be in memory on the local machine, naturally, and thus is bound by objects that fit in your desktop memory. This package uses Amazon's Elastic Map Reduce framework which is "Hadoop billed by the drink" but this is not a map/reduce system. The reduce step is, literally, cat. But the mapper step is harnessed as a "grid engine" of sorts. A Segue grid takes a little less than 10 minutes to start, but then is able to start individual jobs in under a minute (depending on the size of the list you are applying across). So there is significant latency, naturally. Running Segue grids requires Amazon Web Services credentials which are stored only on your local machine. You will be billed by Amazon for your machine time. The default machine size is "small" which has 1.7 gb of RAM and costs $0.085 per hour of run time. But if you have interest in testing this package and feel financially constrained, Amazon has been nice enough to give me some coupons for AWS run time. Just shoot me a note and I'll be happy to share these with you. You can find the Segue repo here: http://code.google.com/p/segue/ If you install the package you can run a simple test like this: require(segue) ## requires your AWS access Key and Secret Key setCredentials("yourKey", "yourSecretKey", setEnvironmentVariables=TRUE) myCluster <- createCluster(numInstances=5) myList <- NULL set.seed(1) for (i in 1:10){ ?a <- c(rnorm(999), NA) ?myList[[i]] <- a } outputEmr ? <- emrlapply(myCluster, myList, mean, ?na.rm=T) ouputLocal ?<- lapply(myList, mean, na.rm=T) all.equal(outputEmr, ouputLocal) stopCluster(myCluster) This email is the very first time I've shared this code publicly. Please feel free to email me directly or fill out issue reports on the Google Code site. Any and all feedback is appreciated. And, yes, it's on my road map for Segue to be a 'for each' backend. I just want to get all the kinks worked out of the basic code first. Thanks in advance, James "JD" Long