Skip to content

Cluster R "environment" trouble. Using Rmpi

2 messages · Paul Johnson, Hao Yu

#
Hi, everybody.

A user came in with a problem on our Rocks Linux Cluster. His function
runs fine in an interactive session, but when he sends the function to
compute nodes with Rmpi, they never return.  I'd not seen that before.
 We are sending out a few big tasks to a few nodes.

So I took his code, which is hundreds of lines long, spread across 4
files, and I've been staring at it for hours.  It makes me wonder ...

Question 1. How do auxiliary functions find their way onto compute nodes?

On the master, this sends "SimJob" to the compute nodes. SimJob is
inside "SimJob.R", as is "pars".  But if SimJob calls other functions,
how does the compute node find them?

############################################
library(Rmpi)
mpi.spawn.Rslaves(nslaves=4)

source("SimJob.R")
pars

ExitStatus <- mpi.parApply(pars, MARGIN=1, fun=SimJob)
cat("\n",table(ExitStatus),"\n")

mpi.close.Rslaves()
mpi.quit()
############################################

The SimJob.R does lots of things, it creates the object "pars" and
many other functions and definitions.

 "SimJob.R" has some interlinked functions like this:

pre1 <- function(i)   {  whatever; source("someFile.R") }

pre2 <- function (j, something) {  whatever(something);
source("someOtherFile.R") }

pre3 <- function(i) { whatever }

SimJob <- function(x,i, j){
    result1 <-  pre(i)
    result2 <- pre2(j, result1)
    result3 <- someRFunction(result1, result2)
}

someRFunction is in an R package, say "lm" or something like that.

How does a compute node  get functions "pre" and "pre2" and the files
they source?

What if the implementation of pre2 calls some function pre3?

We ARE on an NFS system with home folder available on all compute
nodes.  But the compute nodes don't inherit the working directory of
the master, do they?

Here's the frustrating part. I can run interactively on the master
But the whole job won't run on the compute nodes.

2. Suppose a function that we send to a node tries to write a result.
It has "save(whatever,file="blha.Rda")  in it.   Where does that file
go?  What is the "current working directory" on the compute node?

I think that we have to re-write this so we return the information to
the master node and save it there.


3. Is there a way I can find out what is going on "over there" on a
compute node while it is working?

I wish I could put a bunch of print statements in so I could track the
thing's progress, but don't know  how to monitor them.

When this program runs interactively, it spits out some messages to
StdOut.  On a compute node, where do those go?

I've used the web program "ganglia" to see that nodes are actually
being used.  They are, using lots of CPU.


I've re-worked this code so that it  is all in one file (no more use
of source).  Still the same thing.

I can run SimJob () on the interactively,  but it never runs on the slaves.

Well, so long, I would appreciate your ideas.
4 days later
#
Hi Paul,

Just got back from two conferences.

First of all, when R slaves are spawned, they are "naked", meaning they
are started with basic R functions/lib even that they are in the same dir
with master. You have to tell slaves to get all necessary objects or to
load libraries specifically. There are a few ways to do so.

Use mpi.bcast.Robj2slave(an Robj) to send "an Robj" from master to all
slaves. If a function to be executed on slaves depends on many
functions/data, those functions/data must be sent to slaves first.

Use mpi.bcast.cmd (cmd()) to tell salves to run cmd() like
source("SimJob.R") (make sure to remove any execution commands in
SimJob.R). I don't know if race condition will be an issue since slaves
are competing for the same file.

mpi.scatter.Robj/mpi.gather.Rojb can also be used to send/receive objects
among master and slaves.

Hao
Paul Johnson wrote: