Skip to content

doSNOW + foreach = embarrassingly frustrating computation

3 messages · Marius Hofert, Martin Morgan

#
Hi all,

Martin Morgan responded off-list and pointed out that I might have used the wrong bsub-command. He suggested:
bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --vanilla -f minimal.R 
Since my installed packages were not found (due to --no-environ as part of --vanilla), I used:
bsub -n 4 -R "select[model==Opteron8380]" mpirun -n 1 R --no-save -q -f minimal.R 
Below, please find all the outputs [ran under the same setup as before], with comments.
It seems like (2) and (6) almost solve the problem. But what does this "finalize" mean?

Cheers,

Marius


(1) First trial (check if MPI runs):

minimal example as given on http://math.acadiau.ca/ACMMaC/Rmpi/sample.html 

## ==== output (1) start ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 192910: <mpirun -n 1 R --no-save -q -f m01.R> Done

Job <mpirun -n 1 R --no-save -q -f m01.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:44:03 2010
Results reported at Tue Dec 21 19:44:19 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m01.R
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     24.90 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
+     library("Rmpi")
+     }
4 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 5 is running on: a6231 
slave1 (rank 1, comm 1) of size 5 is running on: a6231 
slave2 (rank 2, comm 1) of size 5 is running on: a6231 
slave3 (rank 3, comm 1) of size 5 is running on: a6231 
slave4 (rank 4, comm 1) of size 5 is running on: a6231
+     if (is.loaded("mpi_initialize")){
+         if (mpi.comm.size(1) > 0){
+             print("Please use mpi.close.Rslaves() to close slaves.")
+             mpi.close.Rslaves()
+         }
+         print("Please use mpi.quit() to quit R")
+         .Call("mpi_finalize")
+     }
+ }
$slave1
[1] "I am 1 of 5"

$slave2
[1] "I am 2 of 5"

$slave3
[1] "I am 3 of 5"

$slave4
[1] "I am 4 of 5"
--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          a6231.hpc-net.ethz.ch (PID 8966)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------
[1] 1
## ==== output (1) end ====

=> now the there is no error anymore (only the warning (?))

(2) Second trial 

## ==== output (2) start ====

Sender: LSF System <lsfadmin at a6231>
Subject: Job 193052: <mpirun -n 1 R --no-save -q -f m02.R> Exited

Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6231>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:39 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m02.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      7.20 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
3 slaves are spawned successfully. 0 failed.
[1] "RNGstream"
+    sqrt(i)
+ }
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051
[1] 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9048 on
node a6231.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (2) end ====

=> okay, the first hope: The calculations were done. But why "exit code 1"/finalize problem?

(3) Third trial 

## ==== output (3) start ====

Sender: LSF System <lsfadmin at a6204>
Subject: Job 193053: <mpirun -n 1 R --no-save -q -f m03.R> Exited

Job <mpirun -n 1 R --no-save -q -f m03.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6204>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:36 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m03.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.93 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
Error in makeMPIcluster(spec, ...) : no nodes available.
Calls: makeCluster -> makeMPIcluster
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9530 on
node a6204.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (3) end ====

(4) Fourth trial 

## ==== output (4) start ====

Sender: LSF System <lsfadmin at a6278>
Subject: Job 193056: <mpirun -n 1 R --no-save -q -f m04.R> Exited

Job <mpirun -n 1 R --no-save -q -f m04.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6278>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m04.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      1.01 sec.
    Max Memory :         3 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
Error in makeMPIcluster() : no nodes available.
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9778 on
node a6278.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (4) end ====

=> now (3) and (4) run and stop, but with errors.

(5) Fifth trial 

## ==== output (5) start ====

Sender: LSF System <lsfadmin at a6244>
Subject: Job 193057: <mpirun -n 1 R --no-save -q -f m05.R> Exited

Job <mpirun -n 1 R --no-save -q -f m05.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6244>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:37 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m05.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.98 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
Error in checkCluster(cl) : not a valid cluster
Calls: clusterSetupRNG ... clusterSetupRNGstream -> clusterApply -> staticClusterApply -> checkCluster
Execution halted
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 9571 on
node a6244.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (5) end ====

(6) Sixth trial

## ==== output (6) start ====

Sender: LSF System <lsfadmin at a6266>
Subject: Job 193058: <mpirun -n 1 R --no-save -q -f m06.R> Exited

Job <mpirun -n 1 R --no-save -q -f m06.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6266>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 19:49:28 2010
Results reported at Tue Dec 21 19:49:41 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m06.R
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      3.69 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
3 slaves are spawned successfully. 0 failed.
[1] "RNGstream"
+    sqrt(i)
+ }
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051
[1] 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 24975 on
node a6266.hpc-net.ethz.ch exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

## ==== output (6) end ====

=> similar to (2)
#
On 12/21/2010 10:59 AM, Marius Hofert wrote:
I would confirm that your or the site's R environment file is not doing
anything unusual; I'm surprised that you need this set. More below...
This part of the 'minimal' example doesn't seem minimal, I'd remove it,
but follow it's advice and conclude your scripts with

  mpi.close.Rslaves()
  mpi.quit()
here I think you are begin told to end your script with

  mpi.quit()
here snow is determining the size of the cluster with mpi.comm.size()
(which returns 0) whereas I think you want to do something like

   n = mpi.universe.size()
   cl = makeCluster(n, type="MPI")

likewise below. In some cases mpi.universe.size() uses a system call to
'lamnodes', which will fail on systems without a lamnodes command; the
cheap workaround is to create an executable file called lamnodes that
does nothing and is on your PATH.

Martin

  
    
#
Okay, I ran (2) and (3) again, now with the lines

mpi.close.Rslaves()
mpi.quit()

(as suggested) in the end. Both programs stopped with:
Error in mpi.close.Rslaves() : It seems no slaves running on comm 1
Execution halted

I therefore ran (2) and (3) again, but only with 
mpi.quit()
Below is the output.

So it seems to work! I have the following remaining questions:

(1) with "n <- mpi.universe.size()", program (3) elegantly uses all available CPUs (if I understand this command correctly). Can this always be used or should one think like that: "I need 3 workers, that's why I should use makeCluster(3, type = "MPI")". Obviously, with "n <- mpi.universe.size()" one is not required to think about the number of workers.

(2) Brian Peterson pointed out that doMPI might be a better choice. I would like to have a similar minimal example as (3) but with doMPI. Unfortunately, I couldn't find an equivalent of "clusterSetupRNG()" [from snow] for doMPI. Do you know how to setup rlecuyer with doMPI? I tried the following, but that does not work (of course):

## snippet doMPI start ====

library(doMPI) 
library(foreach)
library(rlecuyer)

cl <- startMPIcluster()
clusterSetupRNG(cl, seed = rep(1,6)) # => only works with doSNOW (otherwise, you'll get 'Error: could not find function "clusterSetupRNG"')
registerDoMPI(cl) # register the cluster object with foreach
## start the work
x <- foreach(i = 1:3) %dopar% { 
   sqrt(i)
}
x 
stopCluster(cl) # properly shut down the cluster
mpi.quit()

## snippet doMPI end ====

Cheers,

Marius

## === output of (2) start === 

Sender: LSF System <lsfadmin at a6169>
Subject: Job 195661: <mpirun -n 1 R --no-save -q -f m02.R> Done

Job <mpirun -n 1 R --no-save -q -f m02.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6169>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 21:15:16 2010
Results reported at Tue Dec 21 21:15:27 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m02.R
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :      6.41 sec.
    Max Memory :         4 MB
    Max Swap   :        29 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
3 slaves are spawned successfully. 0 failed.
[1] "RNGstream"
+    sqrt(i)
+ }
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051
[1] 1
## === output of (2) end === 

## === output of (3) start ===

Sender: LSF System <lsfadmin at a6169>
Subject: Job 195663: <mpirun -n 1 R --no-save -q -f m03.R> Done

Job <mpirun -n 1 R --no-save -q -f m03.R> was submitted from host <brutus3> by user <hofertj> in cluster <brutus>.
Job was executed on host(s) <4*a6169>, in queue <pub.1h>, as user <hofertj> in cluster <brutus>.
</cluster/home/math/hofertj> was used as the home directory.
</cluster/home/math/hofertj> was used as the working directory.
Started at Tue Dec 21 21:15:16 2010
Results reported at Tue Dec 21 21:15:31 2010

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
mpirun -n 1 R --no-save -q -f m03.R
------------------------------------------------------------

Successfully completed.

Resource usage summary:

    CPU time   :     24.67 sec.
    Max Memory :       297 MB
    Max Swap   :      1934 MB

    Max Processes  :         8
    Max Threads    :        19

The output (if any) follows:
Loading required package: foreach
Loading required package: iterators
Loading required package: codetools
Loading required package: snow
4 slaves are spawned successfully. 0 failed.
[1] "RNGstream"
+    sqrt(i)
+ }
[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051
[1] 1
## === output of (3) end ===