doSNOW + foreach = embarrassingly frustrating computation
On 12/21/2010 10:59 AM, Marius Hofert wrote:
Hi all, Martin Morgan responded off-list and pointed out that I might have used the wrong bsub-command. He suggested: bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --vanilla -f minimal.R Since my installed packages were not found (due to --no-environ as part of --vanilla), I used:
I would confirm that your or the site's R environment file is not doing anything unusual; I'm surprised that you need this set. More below...
bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --no-save -q -f minimal.R Below, please find all the outputs [ran under the same setup as before], with comments. It seems like (2) and (6) almost solve the problem. But what does this "finalize" mean? Cheers, Marius (1) First trial (check if MPI runs): minimal example as given on http://math.acadiau.ca/ACMMaC/Rmpi/sample.html ## ==== output (1) start ==== Sender: LSF System <lsfadmin at a6231> Subject: Job 192910: <mpirun -n 1 R --no-save -q -f m01.R> Done Job <mpirun -n 1 R --no-save -q -f m01.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>. Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>. </cluster/home/math/hofertj> was used as the home directory. </cluster/home/math/hofertj> was used as the working directory. Started at Tue Dec 21 19:44:03 2010 Results reported at Tue Dec 21 19:44:19 2010 Your job looked like: ------------------------------------------------------------ # LSBATCH: User input mpirun -n 1 R --no-save -q -f m01.R ------------------------------------------------------------ Successfully completed. Resource usage summary: CPU time : 24.90 sec. Max Memory : 3 MB Max Swap : 29 MB Max Processes : 1 Max Threads : 1 The output (if any) follows:
## from http://math.acadiau.ca/ACMMaC/Rmpi/sample.html # Load the R MPI package if it is not already loaded. if (!is.loaded("mpi_initialize")) {
+ library("Rmpi")
+ }
# Spawn as many slaves as possible mpi.spawn.Rslaves()
4 slaves are spawned successfully. 0 failed. master (rank 0, comm 1) of size 5 is running on: a6231 slave1 (rank 1, comm 1) of size 5 is running on: a6231 slave2 (rank 2, comm 1) of size 5 is running on: a6231 slave3 (rank 3, comm 1) of size 5 is running on: a6231 slave4 (rank 4, comm 1) of size 5 is running on: a6231
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
+ if (is.loaded("mpi_initialize")){
+ if (mpi.comm.size(1) > 0){
+ print("Please use mpi.close.Rslaves() to close slaves.")
+ mpi.close.Rslaves()
+ }
+ print("Please use mpi.quit() to quit R")
+ .Call("mpi_finalize")
+ }
+ }
This part of the 'minimal' example doesn't seem minimal, I'd remove it, but follow it's advice and conclude your scripts with mpi.close.Rslaves() mpi.quit()
# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
$slave1 [1] "I am 1 of 5" $slave2 [1] "I am 2 of 5" $slave3 [1] "I am 3 of 5" $slave4 [1] "I am 4 of 5"
# Tell all slaves to close down, and exit the program mpi.close.Rslaves()
-------------------------------------------------------------------------- An MPI process has executed an operation involving a call to the "fork()" system call to create a child process. Open MPI is currently operating in a condition that could result in memory corruption or other system errors; your MPI job may hang, crash, or produce silent data corruption. The use of fork() (or system() or other calls that create child processes) is strongly discouraged. The process that invoked fork was: Local host: a6231.hpc-net.ethz.ch (PID 8966) MPI_COMM_WORLD rank: 0 If you are *absolutely sure* that your application will successfully and correctly survive a call to fork(), you may disable this warning by setting the mpi_warn_on_fork MCA parameter to 0. -------------------------------------------------------------------------- [1] 1
mpi.quit()
## ==== output (1) end ====
=> now the there is no error anymore (only the warning (?))
(2) Second trial
## ==== output (2) start ====
Sender: LSF System <lsfadmin at a6231>
Subject: Job 193052: <mpirun -n 1 R --no-save -q -f m02.R> Exited
Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:39 2010
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m02.R
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 7.20 sec.
Max Memory : 3 MB
Max Swap : 29 MB
Max Processes : 1
Max Threads : 1
The output (if any) follows:
library(doSNOW)
Loading required package: foreach Loading required package: iterators Loading required package: codetools Loading required package: snow
library(Rmpi) library(rlecuyer) cl <- makeCluster(3, type = "MPI") # create cluster object with the given number of slaves
3 slaves are spawned successfully. 0 failed.
clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
[1] "RNGstream"
registerDoSNOW(cl) # register the cluster object with foreach
## start the work
x <- foreach(i = 1:3) %dopar% {
+ sqrt(i) + }
x
[[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051
stopCluster(cl) # properly shut down the cluster
[1] 1
-------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 9048 on node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). --------------------------------------------------------------------------
here I think you are begin told to end your script with mpi.quit()
## ==== output (2) end ====
=> okay, the first hope: The calculations were done. But why "exit code 1"/finalize problem?
(3) Third trial
## ==== output (3) start ====
Sender: LSF System <lsfadmin at a6204>
Subject: Job 193053: <mpirun -n 1 R --no-save -q -f m03.R> Exited
Job <mpirun -n 1 R --no-save -q -f m03.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6204>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:36 2010
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m03.R
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 0.93 sec.
Max Memory : 3 MB
Max Swap : 29 MB
Max Processes : 1
Max Threads : 1
The output (if any) follows:
library(doSNOW)
Loading required package: foreach Loading required package: iterators Loading required package: codetools Loading required package: snow
library(Rmpi) library(rlecuyer) cl <- makeCluster() # create cluster object
Error in makeMPIcluster(spec, ...) : no nodes available. Calls: makeCluster -> makeMPIcluster Execution halted
here snow is determining the size of the cluster with mpi.comm.size() (which returns 0) whereas I think you want to do something like n = mpi.universe.size() cl = makeCluster(n, type="MPI") likewise below. In some cases mpi.universe.size() uses a system call to 'lamnodes', which will fail on systems without a lamnodes command; the cheap workaround is to create an executable file called lamnodes that does nothing and is on your PATH. Martin
-------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 9530 on node a6204.hpc-net.ethz.ch exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). --------------------------------------------------------------------------
## ==== output (3) end ====
(4) Fourth trial
## ==== output (4) start ====
Sender: LSF System <lsfadmin at a6278>
Subject: Job 193056: <mpirun -n 1 R --no-save -q -f m04.R> Exited
Job <mpirun -n 1 R --no-save -q -f m04.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6278>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m04.R
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 1.01 sec.
Max Memory : 3 MB
Max Swap : 29 MB
Max Processes : 1
Max Threads : 1
The output (if any) follows:
library(doSNOW)
Loading required package: foreach Loading required package: iterators Loading required package: codetools Loading required package: snow
library(Rmpi) library(rlecuyer) cl <- makeMPIcluster() # create cluster object
Error in makeMPIcluster() : no nodes available.
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9778 on
node a6278.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
## ==== output (4) end ====
=> now (3) and (4) run and stop, but with errors.
(5) Fifth trial
## ==== output (5) start ====
Sender: LSF System <lsfadmin at a6244>
Subject: Job 193057: <mpirun -n 1 R --no-save -q -f m05.R> Exited
Job <mpirun -n 1 R --no-save -q -f m05.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6244>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m05.R
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 0.98 sec.
Max Memory : 4 MB
Max Swap : 29 MB
Max Processes : 1
Max Threads : 1
The output (if any) follows:
library(doSNOW)
Loading required package: foreach Loading required package: iterators Loading required package: codetools Loading required package: snow
library(Rmpi) library(rlecuyer) cl <- getMPIcluster() # get the MPI cluster clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
Error in checkCluster(cl) : not a valid cluster
Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9571 on
node a6244.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
## ==== output (5) end ====
(6) Sixth trial
## ==== output (6) start ====
Sender: LSF System <lsfadmin at a6266>
Subject: Job 193058: <mpirun -n 1 R --no-save -q -f m06.R> Exited
Job <mpirun -n 1 R --no-save -q -f m06.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6266>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:41 2010
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m06.R
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 3.69 sec.
Max Memory : 4 MB
Max Swap : 29 MB
Max Processes : 1
Max Threads : 1
The output (if any) follows:
library(doSNOW)
Loading required package: foreach Loading required package: iterators Loading required package: codetools Loading required package: snow
library(Rmpi) library(rlecuyer) cl <- makeMPIcluster(3) # create cluster object with the given number of slaves
3 slaves are spawned successfully. 0 failed.
clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)
[1] "RNGstream"
registerDoSNOW(cl) # register the cluster object with foreach
## start the work
x <- foreach(i = 1:3) %dopar% {
+ sqrt(i) + }
x
[[1]] [1] 1 [[2]] [1] 1.414214 [[3]] [1] 1.732051
stopCluster(cl) # properly shut down the cluster
[1] 1
-------------------------------------------------------------------------- mpirun has exited due to process rank 0 with PID 24975 on node a6266.hpc-net.ethz.ch exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -------------------------------------------------------------------------- ## ==== output (6) end ==== => similar to (2)
Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793