Help with doMPI on multiple cores on a cluster
As a note, it is safe to start all of the MPI processes using 'mpirun -n 32' with the doMPI package because startMPIcluster doesn't spawn any workers if mpi.comm.size(0) > 1, at least by default. In that case, startMPIcluster calls the worker loop function if the rank is greater than zero so that only rank 0 actually returns from startMPIcluster in order to execute the rest of the R script. I usually let mpirun start all of the MPI processes when using doMPI because it appears that broadcasting is more efficient in that case, at least when using Open MPI. It also makes it easy to initialize the workers in a SPMD style. - Steve
On Mon, Oct 21, 2013 at 11:56 AM, Lockwood, Glenn <glock at sdsc.edu> wrote:
I'd like to echo Steve's advice--try using OpenMPI instead. I've had innumerable problems trying to get mvapich2 (on which IntelMPI is based) and Rmpi to work, and officially, Rmpi only supports openmpi and mpich. It's an uphill battle. Also, you should be running mpirun with -n 1 (as you are already doing) if you are calling R directly. Doing anything else will cause multiple master scripts to run, each spawning its own set of mpi ranks and leaving you with a lot more MPI ranks than you want. Some libraries provide special wrappers that let you call them directly using mpirun -np 32 (e.g., snow provies the RMPISNOW command), but these are unique to each library whereas using mpirun -n1 is universal. Glenn On Oct 21, 2013, at 6:55 AM, Stephen Weston <stephen.b.weston at gmail.com> wrote:
Hi Srihari, I suspect it's an MPI issue. Are you able to run any other simple MPI programs successfully, and specifically, any using R with Rmpi? From the error message, it appears that you're using Intel MPI, which I've never used. I believe Rmpi is primarily tested with Open MPI, which is what I've always used with doMPI. It would be interesting to see if you can run successfully using Open MPI, if that is possible for you. You'll probably need to look for help on an Intel MPI forum, although you may need to reduce the problem to something that doesn't use R. Here is a similar issue that I found on an Intel MPI forum: http://software.intel.com/en-us/forums/topic/329053 You could also try running without spawning, since that may be a problem for Intel MPI. To do that, change the R script to use: cl <- startMPIcluster() Also change the mpirun command in the PBS script to use '-n 32' or don't specify the -n option at all. In that case, mpirun will start all of the workers as well as the master which may work better. Regards, Steve Weston On Mon, Oct 21, 2013 at 9:42 AM, Srihari Radhakrishnan <srihari at iastate.edu> wrote:
Hi,
I've been trying to use the doMPI to run the iterations of a for loop in
parallel (using the foreach package) on a cluster. However, I've been
running into issues - I think its the way I am running the R script, but I
could be wrong. Here's the description of the problem.
We use a PBS scheduler to submit jobs; my script uses 2 nodes (32 cores)
for now. I run 1 version of the R interpreter which internally calls 31
workers using R's mpi libraries. I produce below the PBS script, the R code
(the relevant bits) and the error.
***Begin PBS Script***
#!/bin/bash
#PBS -o BATCH_OUTPUT
#PBS -e BATCH_ERRORS
#PBS -lnodes=2:ppn=16:compute,walltime=12:00:00
# Change to directory from which qsub command was issued
cd $PBS_O_WORKDIR
cat $PBS_NODEFILE
#Call mpirun with 1 copy of the R interpreter. This will spawn 31 workers,
inside the R script
time mpirun -n 1 R --slave -f ParallelAnalysis.R
***End PBS script***
***Begin R Script***
source("http://bioconductor.org/biocLite.R")
#MPI stuff initialization
library(Rmpi)
library(foreach)
library(doMPI)
cl <- startMPIcluster(count=31) #call 31 clusterworkers/slaves
registerDoMPI(cl)
library(MEDIPS)
library(BSgenome)
.
.
*more R code; variable assignments etc; no mpi stuff here*
.
.
#Following code will run 100 parallel iterations using the doMPI library
loaded above and output results to the variable x. x is a table and stores
results from iterations as rows.
x <-foreach(i=1:100,.combine='rbind') %dopar% {
*stuff to do inside loop*
}
write.table(x, "output.tsv") #write x into file.
***End R script***
The execution halts as soon as the libraries are loaded - I get the
following error message repeatedly from both nodes (node 203 and node 202)
*[2:node203] unexpected disconnect completion event from dynamic process
with rank=0 pg*
*_id=kvs_17890_0 0x1fce600*
*Assertion failed in file ../../dapl_conn_rc.c at line 1128: 0*
I am not sure if this is an issue with the compilers or the script itself.
The script runs successfully without using mpi (using only 1 node). Any
help would be highly appreciated.
Thanks in advance,
Srihari
--
Srihari Radhakrishnan
Ph.D candidate
Valenzuela Lab
Iowa State University
[[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc