Skip to content

simple question on R/Rmpi/snow/slurm configuration

9 messages · Martin Morgan, Whit Armstrong, Dirk Eddelbuettel +2 more

#
I'm attempting to get Dirk's example from the "intro to HCP with R"
talk working (http://dirk.eddelbuettel.com/papers/bocDec2008introHPCwithR.pdf).

I have slurm working correctly (all the trivial hostname examples
complete successfully).

I fire up an R sesssion w/ the following command

salloc orterun -n 7 R --vanilla

and then run
suppressMessages(library(Rmpi))

but my console never returns control.

it's just frozen until I <control-c> out of it at which point I get
this message:
[linuxsvr.kls.corp:05875] mca: base: component_find: unable to open
osc pt2pt: file not found (ignored)
orterun: killing job...

orterun noticed that job rank 0 with PID 5875 on node node0 exited on
signal 15 (Terminated).
salloc: Relinquishing job allocation 70
[warmstrong at linuxsvr ~]$

meanwhile squeue shows:

[warmstrong at linuxsvr ~]$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
     71      prod  orterun warmstro   R       0:31      1 node0
[warmstrong at linuxsvr ~]$


Have I missed something crucial?  Should I only be running these
examples in batch mode or with littler?

Thanks in advance,
Whit
#
"Whit Armstrong" <armstrong.whit at gmail.com> writes:
I think you want to salloc your universe, and then run R on one node
of the universe

salloc -n 7 orterun -np 1 R --vanilla

then
will report 7.

Martin

  
    
#
Thanks, Martin.

I am able to load the Rmpi package when I run the command you suggest.
However, when I call makeMPIcluster, the object returned is always
null:

[warmstrong at linuxsvr ~]$ salloc -n 8 orterun -np 1 R --vanilla
salloc: Granted job allocation 84

R version 2.8.0 (2008-10-20)
Copyright (C) 2008 The R Foundation for Statistical Computing
...
...
library(Rmpi)
[linuxsvr.kls.corp:09097] mca: base: component_find: unable to open
osc pt2pt: file not found (ignored)
library(snow)
cl <- getMPIcluster()
cl
NULL
mpi.universe.size()
[1] 8
Any suggestions?

Thanks,
Whit
On Mon, Jan 5, 2009 at 3:46 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
#
Whit Armstrong wrote:
provide makeMPIcluster with an argument 'count' to indicate how many 
nodes to launch, makeMPIcluster(7).

I think makeMPIcluster() is looking at mpi.comm.size to determine how 
many nodes to launch, instead of mpi.universe.size().

A caveat, maybe others will chime in -- I don't usually use slurm or 
snow, so don't have a lot of experience with the specifics of this setup.

Martin

  
    
#
On 5 January 2009 at 16:04, Whit Armstrong wrote:
| > library(Rmpi)
| library(Rmpi)
| [linuxsvr.kls.corp:09097] mca: base: component_find: unable to open
| osc pt2pt: file not found (ignored)
| > library(snow)
| library(snow)
| >  cl <- getMPIcluster()
|  cl <- getMPIcluster()
| > cl

I don't think that works.  You need to be explicit in the creation of the
cluster.  The best trick I found was in re-factoring / abstracting-out what
snow does in its internal scripts. I showed that in the UseR talk (as opposed
to tutorial) and picked it up in last months presentation. It goes as
follows:

-----------------------------------------------------------------------------
#!/usr/bin/env r

suppressMessages(library(Rmpi))
suppressMessages(library(snow))

#mpirank <- mpi.comm.rank(0)    # just FYI
ndsvpid <- Sys.getenv("OMPI_MCA_ns_nds_vpid")
if (ndsvpid == "0") {                   # are we master ?
    #cat("Launching master (OMPI_MCA_ns_nds_vpid=", ndsvpid, " mpi rank=",     mpirank, ")\n")
    makeMPIcluster()
} else {                                # or are we a slave ?
    #cat("Launching slave with (OMPI_MCA_ns_nds_vpid=", ndsvpid, " mpi rank=", mpirank, ")\n")
    sink(file="/dev/null")
    slaveLoop(makeMPImaster())
    q()
}

## a trivial main body, but note how getMPIcluster() learns from the
## launched cluster how many nodes are available
cl <- getMPIcluster()
clusterEvalQ(cl, options("digits.secs"=3))      ## use millisecond
## granularity
res <- clusterCall(cl, function() paste(Sys.info()["nodename"],
## format(Sys.time())))
print(do.call(rbind,res))
stopCluster(cl)
-----------------------------------------------------------------------------

which you can launch via salloc as Martin suggested to create a slurm
allocation. Then use orterun to actually use it have orterun call your
script. I tend to wrap things into littler script.  I.e. something like

      $ salloc -w host[1-32] orterun -n 8 nameOfTheScriptAbove.r

where you should then see 7 hosts (as one acts as the dispatching controller,
so you get N-1 working out of N assigned by orterun).

This has the advantage of never hard-coding how many nodes you use. It is all
driven from the commandline.  If you always have the same fixed nodes, then
it easier to just use the default snow cluster creation with hard-wired
nodes.

Hth,  Dirk
#
Thanks, Dirk.

I can run your example, but I'm confused about two things.

1) I can only get the jobs to run on node0 (the controller node), no
matter what number I use for -n or -w.

2) I don't understand how to use this example in the context of the
parLapply function.  It's possible that I don't understand your
script, but to me it seems like orterun is simply sending this script
out to all the nodes to be executed.  What I really want to do is load
my data into a list, then do a parLapply on the list such that each
execution of the function that is applied to the list is allocated out
to a different node.

Sorry that I need so much instruction with this.

Here is the output from running: salloc orterun -n 100 test.mpi.r
(that's your example script).
      [,1]
  [1,] "linuxsvr.kls.corp 2009-01-05 17:25:56.250"
  [2,] "linuxsvr.kls.corp 2009-01-05 17:25:56.257"
  [3,] "linuxsvr.kls.corp 2009-01-05 17:25:56.260"
  [4,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
  [5,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
  [6,] "linuxsvr.kls.corp 2009-01-05 17:25:56.258"
...
and so on.  The hostname in all cases is linuxsvr (the controller node).

When I try w/ the -w option, the job just hangs:

[warmstrong at linuxsvr ~]$ salloc -w node[0-4] orterun -n 100 test.mpi.r
salloc: Granted job allocation 118


and the following may prove helpful:

[warmstrong at linuxsvr ~]$ squeue
  JOBID PARTITION     NAME     USER  ST       TIME  NODES NODELIST(REASON)
    118      prod  orterun warmstro   R       1:03      5 node[0-4]

[warmstrong at linuxsvr ~]$ scontrol show nodes
NodeName=node0 State=ALLOCATED CPUs=8 AllocCPUs=8 RealMemory=64000 TmpDisk=0
   Sockets=2 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node1 State=ALLOCATED CPUs=1 AllocCPUs=1 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=1 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node2 State=ALLOCATED CPUs=4 AllocCPUs=4 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node3 State=ALLOCATED CPUs=2 AllocCPUs=2 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=2 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
NodeName=node4 State=ALLOCATED CPUs=4 AllocCPUs=4 RealMemory=2000 TmpDisk=0
   Sockets=1 Cores=4 Threads=1 Weight=1 Features=(null) Reason=(null)
   Arch=x86_64 OS=Linux
[warmstrong at linuxsvr ~]$


[warmstrong at linuxsvr ~]$ scontrol show job 118
JobId=118 UserId=warmstrong(11122) GroupId=domain users(10513)
   Name=orterun
   Priority=4294901641 Partition=prod BatchFlag=0
   AllocNode:Sid=linuxsvr:8453 TimeLimit=UNLIMITED ExitCode=0:0
   JobState=RUNNING StartTime=01/05-17:27:38 EndTime=NONE
   NodeList=node[0-4] NodeListIndices=0-4
   AllocCPUs=8,1,4,2,4
   ReqProcs=5 ReqNodes=5 ReqS:C:T=1-64.00K:1-64.00K:1-64.00K
   Shared=0 Contiguous=0 CPUs/task=0 Licenses=(null)
   MinProcs=1 MinSockets=1 MinCores=1 MinThreads=1
   MinMemoryNode=0 MinTmpDisk=0 Features=(null)
   Dependency=(null) Account=(null) Requeue=1
   Reason=None Network=(null)
   ReqNodeList=node[0-4] ReqNodeListIndices=0-4
   ExcNodeList=(null) ExcNodeListIndices=
   SubmitTime=01/05-17:27:38 SuspendTime=None PreSusTime=0

[warmstrong at linuxsvr ~]$


Thanks,
Whit
On Mon, Jan 5, 2009 at 4:40 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
1 day later
#
Thanks to everyone for helping me sort out these issues. I finally
have our cluster up and running on all my nodes.

Per Dirk's suggestion, below is a short checklist for anyone setting
up a slurm/Rmpi/snow cluster.

1) ensure that UID's and GID's are identical across all nodes.

We are using windows authentication on our Linux servers, so we had to
remove all the local slurm and munge UID's and GID's from /etc/passwd
and create windows users and groups for slurm and munge to ensure
consistency across all nodes.  Alternatively, you can copy
/etc/password to all the remote nodes, but that is a little bit of a
maintenance nightmare.

2) make sure all your nodes have the same munge.key.

See, "Creating a Secret Key" on this page:
http://home.gna.org/munge/install_guide.html

3) make sure all nodes have the same slurm.key and slurm.conf.

See: "Create OpenSSL keys" on this page:
https://computing.llnl.gov/linux/slurm/quickstart_admin.html

4)  make sure you can ssh to the compute nodes with no password.

Here is a good site:
http://wiki.freaks-unidos.net/ssh%20without%20password
Our setup has /home mounted on all nodes, so just storing the keys in
/home/username/.ssh works.  If remote nodes do not have /home mounted,
then you will need a different setup. This must be done separately for
all users who will use the cluster.

5) try very hard to use the same Linux distribution across all nodes.

Unfortunately, for us, this is not the case.  Our main server is
RHEL5, and all our nodes are Ubuntu.  I had to manually
compile/install openMPI on the Redhat server (as I was very unhappy
with their packaged version).  My issue yesterday was due to orterun
being installed in /usr/local/bin on the controller node (Redhat), and
installed in /usr/bin on the compute nodes (Ubuntu).  openMPI seems to
assume that orterun is in the same location on all machines.  Which
resulted in the following error in slurmd.log:
[Jan 05 14:05:00] [57.0] execve(): /usr/local/bin/orterun: No such
file or directory

Recompiling openMPI on the RHEL server and making sure the locations
of the orterun binary are the same as on the compute nodes finally
fixed the problem.

6) in addition to rebooting nodes also use "sudo scontrol reconfigure"
to make sure that the slurm.conf file is reloaded on compute nodes.

we kept getting jobs stuck in completing state due to a uid/gid
problem.  Which showed the following error:
[Dec 31 12:58:22] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[Dec 31 12:58:22] debug:  _rpc_terminate_job, uid = 11667
[Dec 31 12:58:22] error: Security violation: kill_job(2) from uid 11667

this problem was finally resolved by rebooting all the compute nodes
and using sudo scontrol reconfigure on all the nodes.

7) verify each component independently. per Dirk: basic MPI with one
of the helloWorld examples. Then Rmpi. Then snow. Then slurm.

this allowed me to find the ssh problem with MPI, since slurm/munge
are happy to authenticate with their shared keys rather than using
ssh.
_________________________________________________________________________________________________________________________


I hope this checklist can serve as a useful guide for anyone who faces
the harrowing task of setting up a cluster.  Now that the hard part is
done we are seeing close to linear speedups on our simulations, so the
end result is worth the pain.

The next chore for me is node maintenance.  Dirk has suggested dsh
(dancer's shell):
http://www.netfort.gr.jp/~dancer/software/dsh.html.en and Moe at LLNL
has suggested pdsh https://sourceforge.net/projects/pdsh/.  If anyone
has any additional suggestions, I would love to hear about it.

Cheers,
Whit
#
Hi Whit,

Regarding 4), my slurm setting is actually to disable it so users cannot
remote login or exec on any remote nodes. It seems slurm/munge take care
of authentication and remote executions.

Hao

PS: in /etc/pam.d/common-auth, the following line was added
account    required     /lib/security/pam_slurm.so
Whit Armstrong wrote:

  
    
#
I'd strongly encourage you to adopt puppet for managing cluster nodes;
you want to be able to "promise" that an up to date cluster node is
just like any of your other cluster nodes - same software installed,
same keys distributed, same permutation of bugs.

It may initially seem like a large investment of time to learn
puppet's configuration language and figure out how to automate many of
your sysadmin processes, but the long-term payoff in reliability and
the decrease in your management overhead will *more than* account for
it.  And let you get back to doing research, or whatever it is what
you need the nodes for.  :-)

--elijah