Skip to content

snow errors: cannot run slavehostinfo on slaves

4 messages · Chris Berthiaume, Stephen Weston

#
I'm getting an error when I try to create an MPI cluster with more
than 1 slave node using snow.  Hopefully somebody on the list has
encountered this before.
2 slaves are spawned successfully. 0 failed.
Error in slave.hostinfo(1) : cannot run slavehostinfo on slaves
[compute-0-0.local:22932] [[48203,0],0]-[[48203,2],0]
mca_oob_tcp_msg_recv: readv failed: Connection reset by peer (104)
[compute-0-0.local:22930] [[48203,1],0] routed:binomial: Connection to
lifeline [[48203,0],0] lost

At this point R exits.  The server this was run on is
compute-0-0.local.  If I run makeMPIcluster(1) the single slave node
is created successfully and can be used, for example clusterCall
works.  I've also tried using mpirun to start the master and worker
process, but this doesn't get me much farther.

$ mpirun RMPISNOW  # <-- mpirun gets processor count of 2 from torque
master (rank 0, comm 1) of size 2 is running on: compute-0-0
slave1 (rank 1, comm 1) of size 2 is running on: compute-0-0
library(snow)
cl <- getMPIcluster()
cl
NULL
cl <- makeCluster()
clusterCall(cl, function() Sys.info()[c("nodename")])
...hangs...

So getMPIcluster() returns a null object, and using the object
returned by makeCluster causes R to hang.

Other maybe helpful information:

- I can run MPI C code OK across multiple nodes
- I can use Rmpi to create and use slave nodes OK
- Using Centos 5 x86_4
- Using Rmpi 0.5-9
- Using snow 0.3-8
- Using R 2.12.1
- Using OpenMPI 1.4.4

Thanks for any help with this error,
-Chris
#
Hi Chris,

I'm not an MPI expert, but I've seen some problems running
snow/Rmpi scripts interactively from R.  I suggest that you work
to get the non-interactive case working using mpirun.
I suggest trying to run a simple snow/Rmpi script, such as
the following, which I'll call mpi.R:

  library(snow)
  library(Rmpi)
  cl <- makeMPIcluster(mpi.universe.size() - 1)
  r <- clusterEvalQ(cl, R.version.string)
  print(unlist(r))
  stopCluster(cl)
  mpi.quit()

Note that you have to specify the number of workers to
makeMPIcluster when spawning the workers.  This uses the
mpi.universe.size() function, which will return four in this
case, resulting in three spawned workers (since I subtracted
one from it).

Now run the script using mpirun:

  $ mpirun -n 1 R --slave -f mpi.R

Notice that I used '-n 1' because I only want mpirun to start
one process, which will be the master.  The rest of the
processes (the cluster workers) will be spawned by MPI
when the master calls makeMPIcluster.

If that doesn't work, it's possible that there's a problem with
spawning workers in your MPI installation.  Instead, use the
following script which doesn't use spawning.  All of the
processes are started by mpirun.  I'll call it mpi2.R:

  library(snow)
  library(Rmpi)
  if (mpi.comm.rank(0) > 0) {
    sink(file="/dev/null")
    slaveLoop(makeMPImaster())
    mpi.quit()
  }
  cl <- makeMPIcluster()
  r <- clusterEvalQ(cl, R.version.string)
  print(unlist(r))
  stopCluster(cl)
  mpi.quit()

Some extra code is used to make everyone but rank 0 execute the
slaveLoop() function.  Only rank 0, which we call the master,
actually calls makeMPIcluster(), and it should call
makeMPIcluster() without a worker count.

This time, you don't use the mpirun -n option, so that mpirun
will start four processes in this case.  Rank 0 will become the
master and the rest will be workers:

  $ mpirun R --slave -f mpi2.R

Hopefully one of these two approaches will work for you.

Good luck,

- Steve
On Tue, Dec 13, 2011 at 2:16 PM, Chris Berthiaume <chrisbee at uw.edu> wrote:
#
I had meant to say that these examples assume that you're
working from an interactive Torque job, which seemed to
be your situation.  If you started the job with a command
such as:

  $ qsub -I -l nodes=4 -q devel

then you should get four slots allocated, and mpirun
will default to starting four processes.  That's why you
need to use '-n 1' for the spawn case, but don't need
to use -n for the non-spawn case.

Sorry for any confusion,

- Steve


On Tue, Dec 13, 2011 at 3:48 PM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
#
Yes, I should have mentioned that I was running an interactive torque
job but looks like you sussed that out from my mpirun comment.

$ qsub -lwalltime=01:00:00,nodes=2:ppn=2 -I

I'll try running the non-interactive mpirun examples you suggested and
see what I get.  Thanks for your help.

-Chris

On Tue, Dec 13, 2011 at 1:00 PM, Stephen Weston
<stephen.b.weston at gmail.com> wrote: