openmpi/rmpi/snow: current puzzles, possible improvements [diagnosis]
I think there were several things wrong. 1) I wasn't exporting R_PROFILE to the remote nodes. 2) R CMD BATCH's output file was the same file for all processes, given NFS. 3) The remote nodes did not have Rmpi installed! 3) is obviously crucial; I'm not sure how significant the other problems are. I diagnosed it by changing the output file to /tmp/foo and running only one job on each node. Is there a good way to get unique file names per process on the command line? The only way I can think of is to determine the output file inside the batch script invoked by mpirun and using an env variable, if one is available (i.e., OpenMPI 1.3 or 1.2 in some scenarios) My new invocation looks like this: R_PROFILE=/usr/lib/R/site-library/snow/RMPISNOWprofile; export R_PROFILE mpirun -np 2 -host n5,n7 -x R_PROFILE /usr/bin/R CMD BATCH silly.R I think the R CMD BATCH will send output to stdout and mpi will redirect to the invoking terminal. Since I can't actually run because of 3), this is speculative. Ross
On Wed, 2009-05-13 at 21:52 -0700, Ross Boylan wrote:
After reading through the thread around https://stat.ethz.ch/pipermail/r-sig-hpc/2009-February/000105.html, as well as looking at some other things, for ideas about running snow on top of Rmpi on Debian Lenny, I decided to try a shell script: ---------------------------------------------------------------- R_PROFILE=/usr/lib/R/site-library/snow/RMPISNOWprofile; export R_PROFILE mpirun -np 6 -hostfile hosts R CMD BATCH snowjob.R snowjob.out ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ with this kind of snowjob.R: ------------------------------------------------------------------- # This will only execute on the head node cl <- getMPIcluster() print(mpi.comm.rank(0)) quickinfo <- function() { list(rank=mpi.comm.rank(0), machine=Sys.info()) #system("hostname")) } print(clusterCall(cl, quickinfo)) stopCluster(cl) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ and hosts file ------------------- n7 slots=3 n5 slots=0 # changing this to 2 didn't help n4 slots=4 ^^^^^^^^^^^^^^^^^^^ I'm on n7. Two problems. First, the job shown never terminates. snowjob.out shows the standard R banner, a standard harmless complaint, and then nothing (technically it shows [n7:14829] OOB: Connection to HNP lost but I assume that is after I ^c my shell script). I suspect the problem is that it's having trouble reaching the other nodes. Second, if I have n7 slots=7 the job completes. It shows everything on n7. However, if I use machine=system("hostname") I get back null strings. system("hostname") works fine interactively. Perhaps this is some kind of quoting effect when system("hostname") is exported via clusterCall? Or system() doesn't work under rmpi? I'm also not sure why I am not running into a 3rd problem: it looks as if each process should be writing to the same file snowjob.out (via NFS mounts). That doesn't seem to be happening. Perhaps because the slave R's never make it out of the RMPISNOWProfile code? If anyone has any thoughts or suggestions, I'd love to hear them. Ross P.S. The original problem is that, apparently, makeCluster(n, type="MPI") will not spawn jobs on other nodes--maybe even not more than one job spawned at all. So I'm attempting to bring up snow within an mpi session. I did notice the docs on MPI_COMM_SPAWN http://www.mpi-forum.org/docs/mpi21-report-bw/node202.htm#Node202 indicate there is an info argument which could contain system-dependent information. Presumably this could include a hostname; the standard explicitly leaves this to the implementation. I couldn't find anything on the openmpi implementation. I suppose the source would at least indicate what works now. So, IF openmpi supports it, and if the interface is exposed through Rmpi (which does have mpi.info functions, which might be able to make the right arguments), there would be a possibility of handling this strictly within R.
Ross Boylan wk: (415) 514-8146 185 Berry St #5700 ross at biostat.ucsf.edu Dept of Epidemiology and Biostatistics fax: (415) 514-8150 University of California, San Francisco San Francisco, CA 94107-1739 hm: (415) 550-1062