Rmpi with PBSPro and OpenMPI
Thank you. I was using version 0.5-5. It seems that upgrading to version
0.5-7 seems to have worked, mostly. If I try saving my workspace with
mpi.quit("yes"), I get the following:
[n026:09730] *** Process received signal ***
[n026:09730] Signal: Segmentation fault (11)
[n026:09730] Signal code: (128)
[n026:09730] Failing at address: (nil)
[n026:09730] [ 0] /lib64/tls/libc.so.6 [0x2a95c84500]
[n026:09730] [ 1] /lib64/ld-linux-x86-64.so.2 [0x2a9555d334]
[n026:09730] [ 2] /lib64/ld-linux-x86-64.so.2 [0x2a9555d724]
[n026:09730] [ 3] /lib64/ld-linux-x86-64.so.2 [0x2a9556119f]
[n026:09730] [ 4] /lib64/ld-linux-x86-64.so.2 [0x2a95560ef2]
[n026:09730] [ 5] /usr/lib64/libvapi.so(vipul_cleanup+0x50)
[0x2a9965a4c0]
[n026:09730] *** End of error message ***
mpirun noticed that job rank 0 with PID 9730 on node n026c exited on
signal 11 (Segmentation fault).
Mark Lyman
-----Original Message-----
From: Hao Yu [mailto:hyu at stats.uwo.ca]
Sent: Tuesday, March 10, 2009 11:47 AM
To: Lyman, Mark
Cc: r-sig-hpc at r-project.org
Subject: Re: [R-sig-hpc] Rmpi with PBSPro and OpenMPI
Hi Mark,
What is the version of Rmpi you are using? Version 0.5-5 or older had a
bug in Rprofile but it was solved since 0.5-6.
.Last never intends to be a way to close R slaves. It is only used when
some one doesn't close R salves and master properly. Here is what I
normally do
{karl:58}orterun -n 4 R --no-save -q
master (rank 0, comm 1) of size 4 is running on: karl
slave1 (rank 1, comm 1) of size 4 is running on: karl
slave2 (rank 2, comm 1) of size 4 is running on: karl
slave3 (rank 3, comm 1) of size 4 is running on: karl
#real codes here .... mpi.close.Rslaves()
mpi.close.Rslaves() [1] 1
mpi.quit()
mpi.quit() Please note that master and slaves are created from one communicator. They live or die together, unlike spawning where master can live even slaves quit. Hao
Lyman, Mark wrote:
I just recently discovered this list and thought I would ask a
question
about a mildly annoying issue. Generally, our setup works great,
however, I had to modify the .Last function in the .Rprofile file that
comes with Rmpi. The function now looks like this:
.Last <- function ()
{
if (is.loaded("mpi_initialize")) {
if (mpi.comm.size(1) > 1) {
mpi.bcast.cmd(q("no"))
}
}
}
Without this modification, the R code is run successfully, but when
mpi.quit/mpi.exit/mpi.finalize are run everything stops. It seems that
the slaves are not being shut down appropriately, and the master never
gets the signal it is waiting for that the slaves have shut down. Has
anyone else had this issue and solved it? Or does anyone know what
could
be the cause? I'm not sure, but I'm afraid that this is related to the following
error
that I occasionally get from OpenMPI: [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0 [n039:29963] [0,1,65]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0 [n039:29962] [0,1,64]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0 [n039:29964] [0,1,66]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 [n087:30298] [0,0,0] mca_oob_tcp_recv_handler: invalid message type: 0 [n039:29965] [0,1,67]-[0,0,0] mca_oob_tcp_peer_recv_blocking: recv() failed with errno=104 Usually, I am able to kill and retry the job and everything works
fine,
but sometimes it can fail repeatedly. Please let me know if any more information is needed. As you can see, I am a statistician, and I am very new to HPC. Mark Lyman, Statistician Engineering Systems & Integration, ATK (435) 863-2863 To call in the statistician after the experiment is done may be no
more
than asking him to perform a post-mortem examination: he may be able
to
say what the experiment died of. Sir Ronald Aylmer Fisher
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Department of Statistics & Actuarial Sciences Fax Phone#:(519)-661-3813 The University of Western Ontario Office Phone#:(519)-661-3622 London, Ontario N6A 5B7 http://www.stats.uwo.ca/faculty/yu