Skip to content

R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup

4 messages · Mark Mueller, Dirk Eddelbuettel, Ross Boylan

#
PROBLEM DEFINITION --

Host environment:

- AMD_64, 4xCPU, quad core
- Ubuntu 9.04 64-bit
- OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
to the localhost via ssh to run local jobs) - manually downloaded source and
compiled
- Rmpi 0.5-7
- TM 0.4
- Snow 0.3-3
- R 2.9.0

When executing the following command on the host:

$ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R

the following results, yet the <some program>.R completes successfully:

"mpirun has exited due to process rank 0 with PID [some pid] on node
[node name here] exiting without calling "finalize". This may have
caused other processes in the application to be terminated by signals
sent by mpirun (as reported here)."

CONFIGURATION STEPS TAKEN --

- The hostfile does not create a situation where the system is
oversubscribed.  In this case, slots=4 and max-slots=5.

- The <some program>.R uses snow::activateCluster() and
snow::deactivateCluster() in the appropriate places.  There are no
other code elements that control MPI in the <some program>.R file.

I am suspicious that since the R + TM program completes successfully,
there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
up the MPI environment properly.  This is problematic because any
shell scripts that issue the mpirun directive will capture an exit
status of 1 (i.e. an "error") from the mpirun command, yet there does
not seem to be anything present in the environment that would cause
mpirun (OpenMPI) to encounter an error condition.  This "clouds" the
successful exit status from the R CMD BATCH command.

Are there any known aspects of these packages that have not fully
implemented a complete cleanup routine for MPI implementations using
OpenMPI?

Any insight or assistance will be greatly appreciated.

Sincerely,
Mark
#
Mark,
On 25 August 2009 at 20:50, Mark Mueller wrote:
| PROBLEM DEFINITION --
| 
| Host environment:
| 
| - AMD_64, 4xCPU, quad core
| - Ubuntu 9.04 64-bit
| - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
| to the localhost via ssh to run local jobs) - manually downloaded source and
| compiled
| - Rmpi 0.5-7
| - TM 0.4
| - Snow 0.3-3
| - R 2.9.0
| 
| When executing the following command on the host:
| 
| $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R
| 
| the following results, yet the <some program>.R completes successfully:
| 
| "mpirun has exited due to process rank 0 with PID [some pid] on node
| [node name here] exiting without calling "finalize". This may have
| caused other processes in the application to be terminated by signals
| sent by mpirun (as reported here)."

As I recall, between OpenMPI 1.2* and 1.3.* something changed so that it now
prefers to end jobs on a call to mpi.quit().  Witness this quick example:

   edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
   Hello, rank 1 size 2 on ron
   --------------------------------------------------------------------------
   mpirun has exited due to process rank 1 with PID 19867 on
   node ron exiting without calling "finalize". This may
   have caused other processes in the application to be
   terminated by signals sent by mpirun (as reported here).
   --------------------------------------------------------------------------
   Hello, rank 0 size 2 on ron
   
But if I put an mpi.quit() as last instruction in here, all is well:
   
   edd at ron:/tmp$ echo "mpi.quit()" >> mpiHelloWorld.r
   edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
   Hello, rank 0 size 2 Hello, rankon ron
    1 size 2 on ron
   edd at ron:/tmp$

As an aside, you may like using littler (sudo apt-get install littler) or
Rscript for your scripts instead of the old-school R CMD BATCH.

| CONFIGURATION STEPS TAKEN --
| 
| - The hostfile does not create a situation where the system is
| oversubscribed.  In this case, slots=4 and max-slots=5.
| 
| - The <some program>.R uses snow::activateCluster() and
| snow::deactivateCluster() in the appropriate places.  There are no
| other code elements that control MPI in the <some program>.R file.
| 
| I am suspicious that since the R + TM program completes successfully,
| there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
| up the MPI environment properly.  This is problematic because any

Good diagnosis -- you almost got to mpi.quit() !  

As an aside, I really like running simple helloWorld.r programs jsut to
ensure that the setup is right. Small and simple, easier to analyse.

| shell scripts that issue the mpirun directive will capture an exit
| status of 1 (i.e. an "error") from the mpirun command, yet there does
| not seem to be anything present in the environment that would cause
| mpirun (OpenMPI) to encounter an error condition.  This "clouds" the
| successful exit status from the R CMD BATCH command.
| 
| Are there any known aspects of these packages that have not fully
| implemented a complete cleanup routine for MPI implementations using
| OpenMPI?

I can't tell whether tm needs that or whether your calling script needs it --
but try adding the mpi.quit() and see if that helps.

Cheers, Dirk

| Any insight or assistance will be greatly appreciated.
| 
| Sincerely,
| Mark
| 
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
#
On Tue, 2009-08-25 at 20:50 -0500, Mark Mueller wrote:
FWIW, I use stopCluster(getMPIcluster()) on Debian Lenny (OpenMPI 1.2)
and that seems to work.  I have a feeling that might be an rmpi command
rather than a snow command, even though it's a snow session; maybe I
should shift to deactivateCluster.  On the other hand, maybe
deactivateCluster() doesn't quite shut down.

The system I'm using for this is inaccessible right now, and so I can't
easily check the details.

Ross
#
The deactivateCluster() function in the TM package essentially calls
the stopCluster(getMPICluster()) function in the SNOW package.

Does anyone know if the authors of the SNOW and RMPI packages are part
of this list?
On Tue, Aug 25, 2009 at 11:29 PM, Ross Boylan<ross at biostat.ucsf.edu> wrote: