apparent failure of mpi.isend.Robj (Rmpi)
I have a distributed application that includes 2 simulators and 1 assembler. simulators send results to the assembler. Except that they don't. Communication between these and other components mostly works, but not to the assembler (messages from master also don't reach the assembler). Thinking the assembler had some problem, I replaced it with a stub that simply prints a message when it gets a message. Standard scenario: no messages received by assembler. Scenario 2: I replaced the send function in one simulator with a defective function (that caused the process to fail when it tried to send). In this case, messages from the other simulator were received. (I didn't mean to have a broken function). Scenario 3: I captured the objects being sent (that was what the replacement send function was supposed to do) and then tried to send them manually. First I used mpi.send; that worked. Then I used mpi.isend; that failed. Then I did mpi.send and 2 messages were receivd by the assembler (because of some double echoing it's hard to be sure there really were 2, but I think so.) So it appeared as if mpi.send unstuck the previous message. Thinking there might be buffering going on, I did 20 isends. Nothing was received. Then I did a send, and I never got the command prompt back. The objects being sent are a bit involved (lists of lists of reference class instances), but small (the entire log capture of sends was only about 70k on disk). I suspected I might have overwritten the standard mpi.isend.Robj, but can find no evidence of that (it's not in the global namespace, when I print it the display says in namespace:Rmpi, and no active code reassigns it). I have also tried to isolate components outside of the distributed system, but they behave properly when I do so. At this point I think I'll try building a current mpi library and see if that helps. If anyone has any thoughts, I'm all ears. R 3.0.2, Debian squeeze, openmpi-bin 1.4.2-4 (I think that's the standard one with squeeze), Rmpi 0.6.3 built in my personal directory. Transcript: # r is the object the simulator was trying to send
mpi.send.Robj(r, 1, 4)
# ESS is echoing every command mpi.send.Robj(r, 1, 4) NULL # The "Fake Assembler" message is emited by the fake assembler # (at rank 1) when it receives a message. It prints sender, tag, class of object
Fake Assembler: 0 4 list
mpi.isend.Robj(r, 1, 4) mpi.isend.Robj(r, 1, 4) # nothing happened but I thought that might have been because # cursor was not on the command line. So I tried again.
mpi.isend.Robj(r, 1, 4)
mpi.isend.Robj(r, 1, 4) # still no sign message received # try to verify I'm using the real mpi.isend.Robj
mpi.isend.Robj
mpi.isend.Robj
function (obj, dest, tag, comm = 1, request = 0)
mpi.isend(x = serialize(obj, NULL), type = 4, dest = dest, tag = tag,
comm = comm, request = request)
<environment: namespace:Rmpi>
# now back to send
mpi.send.Robj(r, 1, 4)
mpi.send.Robj(r, 1, 4) Fake Assembler: 0 4 list NULL
Fake Assembler: 0 4 list
Fake Assembler: 0 4 list # apparently 2 or 3 messages received # try to fill buffer and force transmission
for (i in 1:20) mpi.isend.Robj(r, 1, 4)
for (i in 1:20) mpi.isend.Robj(r, 1, 4) # no joy. Try send to flush it out
mpi.send.Robj(r, 1, 4)
mpi.send.Robj(r, 1, 4)
# never came back
C-c C-corterun: killing job...
The child processes are called by distributing the following function to
them and then invoking it:
stemcell is the startup function to intialize the child processes.
Though it's far fromo self-contained, it might be illuminating. In
particular, there are some games with the sending and receiving
functions the simulators use. My guess/hope is that when the stemcell
function is exported the "mpi.isend.Robj" becomes a reference to the
function in the Rmpi namespace on the remote machine.
# to the appropriate type
# rSim <integer> ranks of the simulator processes
# rCoef <list>
# rCoef[[m]] is <integer> ranks of coefficient servers for model m <integer>
# rAssembler <integer> rank of Assembler (singleton) (the one not getting messages)
# iParams indices of parameter sets to analyze
stemcell <- function(rSim, rCoef, rAssembler, iParams) {
source("dbox-common.R")
nSet <- length(iParams)
nChunk <- nSet/dbox.option$chunk
r <- mpi.comm.rank()
#if (! r %in% unlist(rCoef)) return(0) # hack for testing
if (r %in% rSim) {
source("dbox-sim.R")
# temporary for debugging
recvfn <- mpi.recv.Robj
sendfn <- mpi.isend.Robj
if (r == 3L) {
recvfn <- makeLoggingReceive()
sendfn <- makeLoggingSend()
}
db <- dbox.sim(rCoef, rAssembler,
list(b=sim.b5.3.gen3, c=sim.c1.gen3, d=sim.d1.gen3),
totalEffect.b5.3,
nInner=10L,
maxExpectedLog = ceiling((2*nChunk+3.2*nSet)/length(rSim) + 15),
fakeReceive=recvfn,
fakeSend=sendfn)
} else if (r %in% rAssembler) {
source("fake-assembler.R")
# assume each result is communicated separately
recvfn <- makeLoggingReceive("recv-assembler.rdata")
db <- fake.assembler("/local/ross/KHC/sunbelt/testsim", ceiling(2.5*nSet),
fakeReceive=recvfn)
} else {
nlog <- function(servers) {
# return estimated log size for each coefficient server
# current design asks for each coefficient set separately
# since servers are allocated randomly to simulators
# we allow (via length(servers)-2) for some unevenness
ceiling(2.2*nSet/max(length(servers-2), 1) + 3)
}
source("dbox-coef.R")
if (r %in% rCoef[[1]])
# args are class source file, data file, coefficients
# we are now calling this b1 because that is the outcome variable
# but the files are called a1
db <- dbox.coef("simulator.b5.3.R", "s.b5.3", nlog(rCoef[[1]]))
else if (r %in% rCoef[[2]])
db <- dbox.coef("simulator.multinomial.R", "s.multi", nlog(rCoef[[2]]))
else if (r %in% rCoef[[3]])
db <- dbox.coef("simulator.d1.R", "s.d1", nlog(rCoef[[3]]))
else
stop(paste("Rank", r, "has no assigned task"))
}
db$go()
}
When dbox.master initializes it does
mpi.bcast.Robj2slave(stemcell)
mpi.bcast.cmd(stemcell, rSim=mySim, rCoef=myCoef, rAssembler=myAssembler, iParams=myParams)
Ross Boyla