difficulty spawning Rslaves
This might not provide any useful info, but just in case: when running Rmpi, a bunch of log files are temporarily created in the current working directory. Sometimes, they contain a little bit more info than "process in local group is dead". And a couple of other checks: 1. After installing release 7.1.2, you of course recompiled Rmpi against the new versions? 2. Before running R, do your usual lamboot routine and then: 2.1. lamexec C hostname 2.2 tping C N -c 2 (or anyother number after -c) 3. Inisde Rmpi, why do you use mpi.comm.free instead of just mpi.close.Rslaves? For me, for instance, the following works reliably: library(Rmpi) mpi.spawn.Rslaves(nslaves = 1) mpi.close.Rslaves() mpi.spawn.Rslaves(nslaves = 4) Best, R.
On Tue, Dec 29, 2009 at 3:37 PM, Allan Strand <stranda at cofc.edu> wrote:
Thanks Dirk and Ramon. I tried Lam 7.1.2 and am still seeing the same type of behavior. ?Still searching for a solution, and will report back. cheers, a. On 12/28/2009 12:28 PM, Ramon Diaz-Uriarte wrote:
More along Dirk's comments: we currently have two clusters using LAM, both Debian systems, one using v. 7.1.2 of LAM's release and the other 7.1.1. In a current Ubuntu-based laptop, things are working with release 7.1.2. Best, R. On Mon, Dec 28, 2009 at 5:14 PM, Dirk Eddelbuettel<edd at debian.org> ?wrote:
Allan, On 23 December 2009 at 16:05, Allan Strand wrote: | My setup is on a cluster running 64bit FC. ?I have recently broken my | install Rmpi (and hence snow) by upgrading some very old versions of R, | lam/mpi, Rmpi, and snow (currently installed versions listed at the | bottom of this email). ?No doubt this is a problem with my Rmpi install, | but I'm having trouble seeing it. | | I cannot seem to spawn more than a single slave (which is spawned on the | master node) | e.g.: | |> ?mpi.spawn.Rslaves(comm=1,nslaves=1) | ? ? ?1 slaves are spawned successfully. 0 failed. | master (rank 0, comm 1) of size 2 is running on: node0 | slave1 (rank 1, comm 1) of size 2 is running on: node0 | |> ?mpi.comm.free(comm=1) | [1] 1 | |> ?mpi.spawn.Rslaves(comm=1,nslaves=2) | ? ? ?2 slaves are spawned successfully. 0 failed. | Error in mpi.intercomm.merge(intercomm, 0, comm) : | ? ?MPI_Error_string: process in local group is dead | | No doubt the answer is contained in the MPI_Error string, but I'm not | sure how to interpret it. | | Thanks, | Allan | =================================== | Versions (all installed locally in my account with directory appropriate | ./configure settings) | | R 2.10.1 | LAM 7.1.4/MPI 2 C++/ROMIO - Indiana University ?^^^^^^^^^^^^^^^^^^^^^^^^^ For what it is worth, a looong time ago (two years? longer?) when I was helping Manual to get the Debian OpenMPI packages into and when I was transitioning off LAM, I had concluded that the very latest 7.1.X releases of LAM were broken for me. ?The system was a then-current Ubuntu system with the LAM and OpenMPI packages compiled from Debian sources. ?Provided I 'frozen' LAM at 7.1.2 things would work, the newer ones would not. So I'd recommend either downgrading to the last LAM that worked for you, or rather take the plunge and jump to Open MPI. The 1.3.* series is pretty already, and 1.4.0 is just around the corner. Just my $0.02. The problem may of course be entirely different. Dirk -- Three out of two people have difficulties with fractions.
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
-- Allan Strand, ? Biology ? ?http://linum.cofc.edu College of Charleston ? ? ?Ph. (843) 953-9189 Charleston, SC 29424 ? ? ? Fax (843) 953-9199
Ramon Diaz-Uriarte Structural Biology and Biocomputing Programme Spanish National Cancer Centre (CNIO) http://ligarto.org/rdiaz Phone: +34-91-732-8000 ext. 3019