Skip to content

mpi_comm_spawn error with Rmpi and snow on SGI Altix

7 messages · Stefan Theussl, Gad Abraham, Ei-ji Nakama

#
Hi,

I'm running R-2.8.1 on SuSE linux (don't know which version, kernel 
2.6.16.60-0.27-default) on an SGI Altix cluster.

Rmpi 0.5-6 and snow 0.3-3 both installed OK, Rmpi is linked correctly AFAIK:
 > R CMD ldd Rmpi.so
linux-gate.so.1 =>  (0xa000000000000000)
libmpi.so => /usr/lib/libmpi.so (0x200000080008c000)
...

I'm trying to call the script testsnow.R:

library(snow)
cl <- makeCluster(4, "MPI")

fun <- function(i) {
    x <- matrix(rnorm(3e6), ncol=3)
    replicate(1000, crossprod(x))
}

system.time({
    r <- parLapply(cl, 1:4, fun)
})


I call it with another script submitted to qsub:
#!/bin/sh
#$ -cwd
#$ -pe mpi 2
mpirun -np 1 /home/gabraham/bin/R --vanilla < testsnow.R


This fails with the error:

Error calling job_getjid(): No such file or directory
Error in .Call("mpi_comm_spawn", as.character(slave), 
as.character(slavearg),  :
   C symbol name "mpi_comm_spawn" not in DLL for package "Rmpi"
Calls: makeCluster ... switch -> makeMPIcluster -> mpi.comm.spawn -> .Call
Execution halted
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
MPI: aborting job

Any idea what's happening?

Thanks,
Gad
#
Hi Gad,

Presumably you have installed an old implementation of MPI. 
mpi.comm.spawn has been supported since the MPI 1.2 standard as far as I 
remember. Can you pls tell us which implementation/version of MPI you 
use? (in case of LAM you may send us the output of 'laminfo')

Best,
Stefan
Gad Abraham wrote:
2 days later
#
Stefan Theussl wrote:
Hi Stefan & Martin,

I'm merging my offline conversation with Martin.

Details of the system:
SGI Altix 3700Bx2, SUSE enterprise server 10 SP1
/usr/lib/libmpi.so comes from the SGI package sgi-mpt-1.21-sgi601r1 
which according to its release notes supports some MPI2 features like 
MPI_Comm_spawn.

Martin suggested to check if the symbols are in the library, and that 
perhaps Rmpi isn't configuring itself correctly with -DMPI2:

 > nm /usr/lib/libmpi.so | grep comm_spawn
0000000000099180 W mpi_comm_spawn_
0000000000099180 W mpi_comm_spawn__
000000000009b390 W mpi_comm_spawn_multiple_
000000000009b390 W mpi_comm_spawn_multiple__
0000000000164dc0 T MPI_SGI_comm_spawn_request
0000000000099180 T pmpi_comm_spawn_
0000000000099180 W pmpi_comm_spawn__
000000000009b390 T pmpi_comm_spawn_multiple_
000000000009b390 W pmpi_comm_spawn_multiple__


Here's sample output from R CMD INSTALL Rmpi, -DMPI2 indeed not set:

gcc -std=gnu99 -I/home/gabraham/dmf/Software/R-2.8.1/include 
-DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" 
-DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\" -I/usr/include  -DUNKNOWN 
-fPIC -I/usr/local/include    -fpic  -g -O2 -c conversion.c -o conversion.o


If I recompile Rmpi with either MPI_DEPS="-DMPI2" R CMD INSTALL Rmpi, or 
by setting MPI_DEPS="-DMPI2" in Rmpi/configure.ac then autoconf then R 
CMD INSTALL, as suggested by Martin, then -DMPI2 is set:

gcc -std=gnu99 -I/home/gabraham/dmf/Software/R-2.8.1/include 
-DPACKAGE_NAME=\"\" -DPACKAGE_TARNAME=\"\" -DPACKAGE_VERSION=\"\" 
-DPACKAGE_STRING=\"\" -DPACKAGE_BUGREPORT=\"\"  -I/usr/include -DMPI2 
-DUNKNOWN -fPIC -I/usr/local/include    -fpic  -g -O2 -c conversion.c -o 
conversion.o


but I get a different error when I run the example code:

Error calling job_getjid(): No such file or directory
Error in mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = 
count,  :
   Error during spawn request
Calls: makeCluster ... switch -> makeMPIcluster -> mpi.comm.spawn -> .Call
Execution halted
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
MPI: aborting job


Thanks,
Gad
#
Hi,

Was universe of size specified?

http://www.techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Developer/books/MPT_UG/sgi_html/ch03.html#id5188254


2009/2/6 Gad Abraham <gabraham at csse.unimelb.edu.au>:

  
    
2 days later
#
Ei-ji Nakama wrote:
Hi Ei-ji,

I added -up to the script called by qsub:
#!/bin/sh
#$ -cwd
#$ -pe mpi 2
mpirun -up 2 -np 1 /home/gabraham/bin/R --vanilla < testsnow.R

but I still get the same error:
 > cat testsnow.sh.e394520
Error calling job_getjid(): No such file or directory
Error in mpi.comm.spawn(slave = mpitask, slavearg = args, nslaves = 
count,  :
   Error during spawn request
Calls: makeCluster ... switch -> makeMPIcluster -> mpi.comm.spawn -> .Call
Execution halted
MPI: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
MPI: aborting job

G.
#
Hi, Gad.
qsub is OpenPBS?
does as follows maybe if it is OpenPBS.

#!/bin/bash
#PBS -N testsnow
#PBS -q <<Name of queue that you of your machine can use>>
#PBS -l ncpus=3
#PBS -o testsnow.out
#PBS -e testsnow.err
mpirun -up 2 -np 1 /home/gabraham/bin/R CMD BATCH --vanilla testsnow.R

try `man qsub'.

  
    
1 day later
#
Ei-ji Nakama wrote:
Ei-ji,

qsub is from Sun Grid Engine

Gad