We have a similar cluster to yours, and I am able to spawn workers on multiple nodes using the procedure that you describe (except that I don't use the qsub "-V" option). I'm using R 3.0.2, Rmpi 0.6.3, and Open MPI 1.6.5 on a RHEL 6.2 cluster, however, we didn't use the nopsm option when building Open MPI. (Note that I eventually installed Rmpi using the "--no-test-load" option to avoid the "error obtaining unique transport key" problem.) We configured Open MPI 1.6.5 using the options: --enable-shared --enable-static --with-tm --with-openib --with-hwloc=internal Since you appear to be using a PBS-derived system, you might want to try using "--with-tm" (if you're not already) to see if that makes a difference. That option does relate to remote execution, so it seems worth trying. In any case, I'd be very interested to hear if and how you solve the problem. - Steve On Thu, Jun 26, 2014 at 10:30 AM, Russell Pierce
<russell.s.pierce at gmail.com> wrote:
I am having difficulty getting Rmpi to spawn across nodes. My system administrator is knowledgable, but unfamilar with R. Other jobs are able to run across nodes on the cluster without difficulty. The system I am working on has multiple nodes running R 3.0.2 on x86_64-redhat-linux-gnu (64-bit) with Rmpi_0.6-3 with a openmpi version 1.6.5 complied with a nopsm option. nopsm was set while tracking down another error message on the basis of another post elsewhere (http://www.open-mpi.org/community/lists/users/2011/10/17660.php) and seemed to help get Rmpi to compile and run on the remote node. Rmpi was specifically R CMD INSTALLed against this nopsm version of openmpi. What I'd like to be able to do, as a proof of concept, is run R interactively with access to the multiple nodes on the cluster. Here is my minimal example...
From the login node I can run:
qsub -I -V -l nodes=2:ppn=12
I am transferred to one of the computation nodes, and I can tell that
I?ve been assigned two nodes to work on using the ?mynodes? command in
bash. When I ?cat $PBS_NODEFILE I get a list of each node name
repeated 16 times. Therefore, I am reasonably sure I was actually
assigned distinct nodes.
I launch R with the bash command:
mpirun -np 1 -hostfile $PBS_NODEFILE R --interactive ?-vanilla
I've also tried using the -n option rather than -np as I've seen in
some other sample scripts with similar results.
Within R on one of the computation node I type the following commands:
library(Rmpi)
mpi.spawn.Rslaves()
mpi.remote.exec(paste(Sys.info()[c("nodename")],"checking in
as",mpi.comm.rank(),"of",mpi.comm.size()))
... the results of these commands indicate that all of the slaves
started on the same node.
I saw the "Rmpi spawning across nodes" topic from March of 2012.
"Snow Not Distributing" from 2012 demonstrates a similar problem. I
tried Ex60-HelloWorldSnow from that source, but all results indicate
that they were generated from the same node.
Is what I am aiming to do possible? If so, is there something I am
doing incorrectly or that I need to check/report to help diagnose the
problem?
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc