Dear list, I am trying to launch an Rmpi job inside a SGE system and I am experiencing an unwanted behaviour. Up to a certain number of required slots the procedure works successfully. Beyond that certain number I obtain a segmetation fault in Rslaves.sh while spawning the slaves. More in the details, I try to use a rocks linux cluster with SGE queue system. Let's say that I use the classical example: http://math.acadiau.ca/ACMMaC/Rmpi/sample.html I put that in a file called: provaMPI.R to which I add first line: #!/usr/bin/Rscript and modify with mpi.spawn.Rslaves(20) Then I write the bash script (script.sh) to send to qsub: #!/bin/sh # Run using bash #$ -S /bin/bash #$ -N provaMPI.R #$ -pe mpi 21 #$ -cwd /opt/openmpi/bin/orterun -np 1 provaMPI.R And finally send: shell$ qsub script.sh The cluster is set up to run up to 12 processes in each node. I expect to see filled these slots greedily in the nodes selected by the queueing system, since this is what I see as the policy in the three Parallel Environments that are present in the cluster (mpi, mpich, lam). This is indeed the case if I use for example 12 slaves and require -pe mpi 13, the run execute fine and all output is as expected. But with the case 20 I obtain the segmentation fault. I report below the error messages together with the three Parallel Environments. I experience the same problems with all of them (with lam also some executable lamhalt and lamboot not found). I have not been able to make up a link between the number of slots required and the crashing issue. I have seen a successful case when I was requiring 12 slaves and 13 slots. But I find it hard to reproduce. I also observed cases among the successful ones in which slots where allocated in more than one node. Overall it seems to me quite a random behaviour with a stronger bias for the unsuccessul cases :( Any suggestion about a resolution of this issue would be very much appreciated. Thank you, Best regards, Marco $ qconf -sp lam pe_name lam slots 128 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpi/startlam.sh $pe_hostfile stop_proc_args /opt/gridengine/mpi/stoplam.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary FALSE [stuetzle at submit-1-0 ACOTSP]$ qconf -sp mpi pe_name mpi slots 9999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpi/startmpi.sh $pe_hostfile stop_proc_args /opt/gridengine/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task TRUE urgency_slots min accounting_summary FALSE [stuetzle at submit-1-0 ACOTSP]$ qconf -sp mpich pe_name mpich slots 9999 user_lists NONE xuser_lists NONE start_proc_args /opt/gridengine/mpi/startmpi.sh -catch_rsh $pe_hostfile stop_proc_args /opt/gridengine/mpi/stopmpi.sh allocation_rule $fill_up control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary FALSE In provaMPI.R.###### /usr/lib/R/library/Rmpi/Rslaves.sh: line 20: 19442 Segmentation fault (core dumped) $R_HOME/bin/R --no-init-file --slave --no-save < $1 > $hn.$2.$$.log 2>&1 -------------------------------------------------------------------------- orterun has exited due to process rank 8 with PID 19435 on node compute-1-2.local exiting without calling "finalize". This may have caused other processes in the application to be terminated by signals sent by orterun (as reported here). -------------------------------------------------------------------------- rm: cannot remove `/tmp/2840522.1.medium1/rsh': No such file or directory /opt/gridengine/default/spool/compute-1-14/active_jobs/2840522.1/pe_hostfile compute-1-14 compute-1-14 compute-1-14 compute-1-14 compute-1-14 In the log files with each node's name. *** caught segfault *** address 0x9ff44d4, cause 'memory not mapped' Traceback: 1: .Call("mpi_initialize", PACKAGE = "Rmpi") 2: f(libname, pkgname) 3: firstlib(which.lib.loc, package) 4: doTryCatch(return(expr), name, parentenv, handler) 5: tryCatchOne(expr, names, parentenv, handlers[[1L]]) 6: tryCatchList(expr, classes, parentenv, handlers) 7: tryCatch(expr, error = function(e) { call <- conditionCall(e) if (!is.null(call)) { if (identical(call[[1L]], quote(doTryCatch))) call <- sys.call(-4L) dcall <- deparse(call)[1L] prefix <- paste("Error in", dcall, ": ") LONG <- 75L msg <- conditionMessage(e) sm <- strsplit(msg, "\n")[[1L]] w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w") if (is.na(w)) w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L], type = "b") if (w > LONG) prefix <- paste(prefix, "\n ", sep = "") } else prefix <- "Error : " msg <- paste(prefix, conditionMessage(e), "\n", sep = "") .Internal(seterrmessage(msg[1L])) if (!silent && identical(getOption("show.error.messages"), TRUE)) { cat(msg, file = stderr()) .Internal(printDeferredWarnings()) } invisible(structure(msg, class = "try-error"))}) 8: try(firstlib(which.lib.loc, package)) 9: library(Rmpi, logical.return = TRUE) aborting ... Then all other nodes: Error in f(libname, pkgname) : ignoring SIGPIPE signal Error in f(libname, pkgname) : ignoring SIGPIPE signal Error in f(libname, pkgname) : ignoring SIGPIPE signal Error in f(libname, pkgname) : ignoring SIGPIPE signal -- Marco Chiarandini, PhD Department of Mathematics and Computer Science, University of Southern Denmark Campusvej 55, DK-5230 Odense M, Denmark marco at imada.sdu.dk, http://www.imada.sdu.dk/~marco Phone: +45 6550 4031, Fax: +45 6550 2325
Problems with Rmpi, openMPI and SEG in rocks linux
1 message · Marco Chiarandini