R <--> TM <--> Snow <--> Rmpi <--> OpenMPI cluster cleanup
On Tue, 2009-08-25 at 20:50 -0500, Mark Mueller wrote:
PROBLEM DEFINITION -- Host environment: - AMD_64, 4xCPU, quad core - Ubuntu 9.04 64-bit - OpenMPI 1.3.2 (to avoid the problem in v1.3 where OpenMPI tries to connect to the localhost via ssh to run local jobs) - manually downloaded source and compiled - Rmpi 0.5-7 - TM 0.4 - Snow 0.3-3 - R 2.9.0 When executing the following command on the host: $ mpirun --hostfile <some file> -np 1 R CMD BATCH <some program>.R the following results, yet the <some program>.R completes successfully: "mpirun has exited due to process rank 0 with PID [some pid] on node [node name here] exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by mpirun (as reported here)." CONFIGURATION STEPS TAKEN -- - The hostfile does not create a situation where the system is oversubscribed. In this case, slots=4 and max-slots=5. - The <some program>.R uses snow::activateCluster() and snow::deactivateCluster() in the appropriate places. There are no other code elements that control MPI in the <some program>.R file.
FWIW, I use stopCluster(getMPIcluster()) on Debian Lenny (OpenMPI 1.2) and that seems to work. I have a feeling that might be an rmpi command rather than a snow command, even though it's a snow session; maybe I should shift to deactivateCluster. On the other hand, maybe deactivateCluster() doesn't quite shut down. The system I'm using for this is inaccessible right now, and so I can't easily check the details. Ross