doSNOW + foreach = embarrassingly frustrating computation

Tue, Dec 21, 2010 11:40 AM

On 12/21/2010 10:59 AM, Marius Hofert wrote:

I would confirm that your or the site's R environment file is not doing
anything unusual; I'm surprised that you need this set. More below...

bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --no-save -q -f minimal.R 
Below, please find all the outputs [ran under the same setup as before], with comments.
It seems like (2) and (6) almost solve the problem. But what does this "finalize" mean?

Cheers,

Marius


(1) First trial (check if MPI runs):

minimal example as given on http://math.acadiau.ca/ACMMaC/Rmpi/sample.html 

## ==== output (1) start ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 192910: <mpirun -n 1 R --no-save -q -f m01.R> Done

Job <mpirun -n 1 R --no-save -q -f m01.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:44:03 2010
Results reported at Tue Dec 21 19:44:19 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m01.R
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     24.90 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

## from http://math.acadiau.ca/ACMMaC/Rmpi/sample.html

# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {

+     library("Rmpi")
+     }

                                                                                
# Spawn as many slaves as possible
mpi.spawn.Rslaves()

	4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: a6231 
slave1 (rank 1, comm 1) of size 5 is running on: a6231 
slave2 (rank 2, comm 1) of size 5 is running on: a6231 
slave3 (rank 3, comm 1) of size 5 is running on: a6231 
slave4 (rank 4, comm 1) of size 5 is running on: a6231

                                                                                
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){

+     if (is.loaded("mpi_initialize")){
+         if (mpi.comm.size(1) > 0){
+             print("Please use mpi.close.Rslaves() to close slaves.")
+             mpi.close.Rslaves()
+         }
+         print("Please use mpi.quit() to quit R")
+         .Call("mpi_finalize")
+     }
+ }

This part of the 'minimal' example doesn't seem minimal, I'd remove it,
but follow it's advice and conclude your scripts with

  mpi.close.Rslaves()
  mpi.quit()

# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))

$slave1
[1] "I am 1 of 5"

$slave2
[1] "I am 2 of 5"

$slave3
[1] "I am 3 of 5"

$slave4
[1] "I am 4 of 5"

# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()

--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          a6231.hpc-net.ethz.ch (PID 8966)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[1] 1

mpi.quit()

## ==== output (1) end ====

=> now the there is no error anymore (only the warning (?))

(2) Second trial 

## ==== output (2) start ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 193052: <mpirun -n 1 R --no-save -q -f m02.R> Exited

Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:39 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m02.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      7.20 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

library(doSNOW)

Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow

library(Rmpi)
library(rlecuyer)

cl <- makeCluster(3, type = "MPI") # create cluster object with the given number of slaves

	3 slaves are spawned successfully. 0 failed.

clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)

[1] "RNGstream"

registerDoSNOW(cl) # register the cluster object with foreach
## start the work
x <- foreach(i = 1:3) %dopar% {

+    sqrt(i)
+ }

[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

stopCluster(cl) # properly shut down the cluster

[1] 1

--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9048 on
node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

here I think you are begin told to end your script with

  mpi.quit()

here snow is determining the size of the cluster with mpi.comm.size()
(which returns 0) whereas I think you want to do something like

   n = mpi.universe.size()
   cl = makeCluster(n, type="MPI")

likewise below. In some cases mpi.universe.size() uses a system call to
'lamnodes', which will fail on systems without a lamnodes command; the
cheap workaround is to create an executable file called lamnodes that
does nothing and is on your PATH.

Martin

--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9530 on
node a6204.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (3) end ====

(4) Fourth trial 

## ==== output (4) start ====

Sender: LSF System <lsfadmin at a6278>
Subject: Job 193056: <mpirun -n 1 R --no-save -q -f m04.R> Exited

Job <mpirun -n 1 R --no-save -q -f m04.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6278>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m04.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      1.01 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

library(doSNOW)

Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow

library(Rmpi)
library(rlecuyer)

cl <- makeMPIcluster() # create cluster object

Error in makeMPIcluster() : no nodes available.
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9778 on
node a6278.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (4) end ====

=> now (3) and (4) run and stop, but with errors.

(5) Fifth trial 

## ==== output (5) start ====

Sender: LSF System <lsfadmin at a6244>
Subject: Job 193057: <mpirun -n 1 R --no-save -q -f m05.R> Exited

Job <mpirun -n 1 R --no-save -q -f m05.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6244>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m05.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.98 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

library(doSNOW)

Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow

library(Rmpi)
library(rlecuyer)

cl <- getMPIcluster() # get the MPI cluster
clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)

Error in checkCluster(cl) : not a valid cluster
Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9571 on
node a6244.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (5) end ====

(6) Sixth trial

## ==== output (6) start ====

Sender: LSF System <lsfadmin at a6266>
Subject: Job 193058: <mpirun -n 1 R --no-save -q -f m06.R> Exited

Job <mpirun -n 1 R --no-save -q -f m06.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6266>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:41 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m06.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      3.69 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:

library(doSNOW)

Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow

library(Rmpi)
library(rlecuyer)

cl <- makeMPIcluster(3) # create cluster object with the given number of slaves

	3 slaves are spawned successfully. 0 failed.

clusterSetupRNG(cl, seed = rep(1,6)) # initialize uniform rng streams in a SNOW cluster (L'Ecuyer)

[1] "RNGstream"

registerDoSNOW(cl) # register the cluster object with foreach
## start the work
x <- foreach(i = 1:3) %dopar% {

+    sqrt(i)
+ }

[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

stopCluster(cl) # properly shut down the cluster

[1] 1

--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 24975 on
node a6266.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (6) end ====

=> similar to (2)

Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

doSNOW + foreach = embarrassingly frustrating computation

Thread (3 messages)