Skip to content
Prev 1939 / 2152 Next

Error installing Rmpi over OpenMPI: Cannot find orted

Hello again list, thanks for your replies.

I've reinstalled OMPI and Rmpi as you suggested (sudo apt-get install
openmpi-bin
r-cran-rmpi). I've also installed openmpi-common and libopenmpi-dev to have
OMPI working properly again for C and python.

Unfortunately, Rmpi isn't working yet. I've tried different PBS scripts and
R test files, but I'm not sure what I'm doing wrong:
This is my PBS script:
---
#!/bin/bash
#PBS -N R_test
#PBS -l
nodes=laicbio:ppn=32+laicbio1:ppn=12+laicbio2:ppn=12+laicbio3:ppn=12+la$
cd $PBS_O_WORKDIR
Rscript --no-save test.R
---
This is the test.R file (found online)
---
# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
        library("Rmpi")
}
# Spawn as many slaves as possible
mpi.spawn.Rslaves()
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function() {
        if (is.loaded("mpi_initialize")) {
                if (mpi.comm.size(1) > 0) {
                        print("Please use mpi.close.Rslaves() to close
slaves.")
                        mpi.close.Rslaves()
                }
                print("Please use mpi.quit() to quit R")
                .Call("mpi_finalize")
        }
}
# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size()))
# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()
mpi.quit()
---

It's giving me the following errors:
---
$ cat R_test.e98
[laicbio:67788] [[32125,0],0] ORTE_ERROR_LOG: Not found in file
routed_binomial.c at line 386
[laicbio:67788] [[32125,0],0] ORTE_ERROR_LOG: A message is attempting to be
sent to a process whose contact information is unknown in file
rml_oob_send.c at line 104
[laicbio:67788] [[32125,0],0] could not get route to [[32125,2],0]
---
And the following output:
---
$ cat R_test.o98
    1 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 2 is running on: laicbio
slave1 (rank 1, comm 1) of size 2 is running on: laicbio
$slave1
[1] "I am 1 of 2"

[1] 1
---

If I add mpiexec before Rscript to the PBS script, the job keeps running
(doesn't finish) and I get lots of empty logs named like
laicbio3.9740+1.10076.log, laicbio3 is one of the working nodes.

May you suggest me a way for testing to track the problem down?

Thanks again.
Alejandro

2014-11-08 10:59 GMT-06:00 Dirk Eddelbuettel <edd at debian.org>: