PROBLEM DEFINITION -- Host environment: - AMD_64, 4xCPU, quad core - Ubuntu 9.04 64-bit - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect to the localhost via ssh to run local jobs) - manually downloaded source and compiled - Rmpi 0.5-7 - TM 0.4 - Snow 0.3-3 - R 2.9.0 When executing the following command on the host: $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R the following results, yet the <some program>.R completes successfully: "mpirun has exited due to process rank 0 with PID [some pid] on node [node name here] exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here)." CONFIGURATION STEPS TAKEN -- - The hostfile does not create a situation where the system is oversubscribed. In this case, slots=4 and max-slots=5. - The <some program>.R uses snow::activateCluster() and snow::deactivateCluster() in the appropriate places. There are no other code elements that control MPI in the <some program>.R file. I am suspicious that since the R + TM program completes successfully, there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning up the MPI environment properly. This is problematic because any shell scripts that issue the mpirun directive will capture an exit status of 1 (i.e. an "error") from the mpirun command, yet there does not seem to be anything present in the environment that would cause mpirun (OpenMPI) to encounter an error condition. This "clouds" the successful exit status from the R CMD BATCH command. Are there any known aspects of these packages that have not fully implemented a complete cleanup routine for MPI implementations using OpenMPI? Any insight or assistance will be greatly appreciated. Sincerely, Mark
R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup
4 messages · Mark Mueller, Dirk Eddelbuettel, Ross Boylan
Mark,
On 25 August 2009 at 20:50, Mark Mueller wrote:
| PROBLEM DEFINITION --
|
| Host environment:
|
| - AMD_64, 4xCPU, quad core
| - Ubuntu 9.04 64-bit
| - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect
| to the localhost via ssh to run local jobs) - manually downloaded source and
| compiled
| - Rmpi 0.5-7
| - TM 0.4
| - Snow 0.3-3
| - R 2.9.0
|
| When executing the following command on the host:
|
| $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R
|
| the following results, yet the <some program>.R completes successfully:
|
| "mpirun has exited due to process rank 0 with PID [some pid] on node
| [node name here] exiting without calling "finalize". This may have
| caused other processes in the application to be terminated by signals
| sent by mpirun (as reported here)."
As I recall, between OpenMPI 1.2* and 1.3.* something changed so that it now
prefers to end jobs on a call to mpi.quit(). Witness this quick example:
edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
Hello, rank 1 size 2 on ron
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 19867 on
node ron exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
Hello, rank 0 size 2 on ron
But if I put an mpi.quit() as last instruction in here, all is well:
edd at ron:/tmp$ echo "mpi.quit()" >> mpiHelloWorld.r
edd at ron:/tmp$ mpirun -n 2 ./mpiHelloWorld.r
Hello, rank 0 size 2 Hello, rankon ron
1 size 2 on ron
edd at ron:/tmp$
As an aside, you may like using littler (sudo apt-get install littler) or
Rscript for your scripts instead of the old-school R CMD BATCH.
| CONFIGURATION STEPS TAKEN --
|
| - The hostfile does not create a situation where the system is
| oversubscribed. In this case, slots=4 and max-slots=5.
|
| - The <some program>.R uses snow::activateCluster() and
| snow::deactivateCluster() in the appropriate places. There are no
| other code elements that control MPI in the <some program>.R file.
|
| I am suspicious that since the R + TM program completes successfully,
| there is something in the Rmpi/Snow/OpenMPI layer that is not cleaning
| up the MPI environment properly. This is problematic because any
Good diagnosis -- you almost got to mpi.quit() !
As an aside, I really like running simple helloWorld.r programs jsut to
ensure that the setup is right. Small and simple, easier to analyse.
| shell scripts that issue the mpirun directive will capture an exit
| status of 1 (i.e. an "error") from the mpirun command, yet there does
| not seem to be anything present in the environment that would cause
| mpirun (OpenMPI) to encounter an error condition. This "clouds" the
| successful exit status from the R CMD BATCH command.
|
| Are there any known aspects of these packages that have not fully
| implemented a complete cleanup routine for MPI implementations using
| OpenMPI?
I can't tell whether tm needs that or whether your calling script needs it --
but try adding the mpi.quit() and see if that helps.
Cheers, Dirk
| Any insight or assistance will be greatly appreciated.
|
| Sincerely,
| Mark
|
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Three out of two people have difficulties with fractions.
On Tue, 2009-08-25 at 20:50 -0500, Mark Mueller wrote:
PROBLEM DEFINITION -- Host environment: - AMD_64, 4xCPU, quad core - Ubuntu 9.04 64-bit - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect to the localhost via ssh to run local jobs) - manually downloaded source and compiled - Rmpi 0.5-7 - TM 0.4 - Snow 0.3-3 - R 2.9.0 When executing the following command on the host: $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R the following results, yet the <some program>.R completes successfully: "mpirun has exited due to process rank 0 with PID [some pid] on node [node name here] exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here)." CONFIGURATION STEPS TAKEN -- - The hostfile does not create a situation where the system is oversubscribed. In this case, slots=4 and max-slots=5. - The <some program>.R uses snow::activateCluster() and snow::deactivateCluster() in the appropriate places. There are no other code elements that control MPI in the <some program>.R file.
FWIW, I use stopCluster(getMPIcluster()) on Debian Lenny (OpenMPI 1.2) and that seems to work. I have a feeling that might be an rmpi command rather than a snow command, even though it's a snow session; maybe I should shift to deactivateCluster. On the other hand, maybe deactivateCluster() doesn't quite shut down. The system I'm using for this is inaccessible right now, and so I can't easily check the details. Ross
The deactivateCluster() function in the TM package essentially calls the stopCluster(getMPICluster()) function in the SNOW package. Does anyone know if the authors of the SNOW and RMPI packages are part of this list?
On Tue, Aug 25, 2009 at 11:29 PM, Ross Boylan<ross at biostat.ucsf.edu> wrote:
On Tue, 2009-08-25 at 20:50 -0500, Mark Mueller wrote:
PROBLEM DEFINITION -- Host environment: - AMD_64, 4xCPU, quad core - Ubuntu 9.04 64-bit - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect to the localhost via ssh to run local jobs) - manually downloaded source and compiled - Rmpi 0.5-7 - TM 0.4 - Snow 0.3-3 - R 2.9.0 When executing the following command on the host: $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R the following results, yet the <some program>.R completes successfully: "mpirun has exited due to process rank 0 with PID [some pid] on node [node name here] exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here)." CONFIGURATION STEPS TAKEN -- - The hostfile does not create a situation where the system is oversubscribed. ?In this case, slots=4 and max-slots=5. - The <some program>.R uses snow::activateCluster() and snow::deactivateCluster() in the appropriate places. ?There are no other code elements that control MPI in the <some program>.R file.
FWIW, I use stopCluster(getMPIcluster()) on Debian Lenny (OpenMPI 1.2) and that seems to work. ?I have a feeling that might be an rmpi command rather than a snow command, even though it's a snow session; maybe I should shift to deactivateCluster. ?On the other hand, maybe deactivateCluster() doesn't quite shut down. The system I'm using for this is inaccessible right now, and so I can't easily check the details. Ross