simple question on R/Rmpi/snow/slurm configuration
Hi Whit, Regarding 4), my slurm setting is actually to disable it so users cannot remote login or exec on any remote nodes. It seems slurm/munge take care of authentication and remote executions. Hao PS: in /etc/pam.d/common-auth, the following line was added account required /lib/security/pam_slurm.so
Whit Armstrong wrote:
Thanks to everyone for helping me sort out these issues. I finally have our cluster up and running on all my nodes. Per Dirk's suggestion, below is a short checklist for anyone setting up a slurm/Rmpi/snow cluster. 1) ensure that UID's and GID's are identical across all nodes. We are using windows authentication on our Linux servers, so we had to remove all the local slurm and munge UID's and GID's from /etc/passwd and create windows users and groups for slurm and munge to ensure consistency across all nodes. Alternatively, you can copy /etc/password to all the remote nodes, but that is a little bit of a maintenance nightmare. 2) make sure all your nodes have the same munge.key. See, "Creating a Secret Key" on this page: http://home.gna.org/munge/install_guide.html 3) make sure all nodes have the same slurm.key and slurm.conf. See: "Create OpenSSL keys" on this page: https://computing.llnl.gov/linux/slurm/quickstart_admin.html 4) make sure you can ssh to the compute nodes with no password. Here is a good site: http://wiki.freaks-unidos.net/ssh%20without%20password Our setup has /home mounted on all nodes, so just storing the keys in /home/username/.ssh works. If remote nodes do not have /home mounted, then you will need a different setup. This must be done separately for all users who will use the cluster. 5) try very hard to use the same Linux distribution across all nodes. Unfortunately, for us, this is not the case. Our main server is RHEL5, and all our nodes are Ubuntu. I had to manually compile/install openMPI on the Redhat server (as I was very unhappy with their packaged version). My issue yesterday was due to orterun being installed in /usr/local/bin on the controller node (Redhat), and installed in /usr/bin on the compute nodes (Ubuntu). openMPI seems to assume that orterun is in the same location on all machines. Which resulted in the following error in slurmd.log: [Jan 05 14:05:00] [57.0] execve(): /usr/local/bin/orterun: No such file or directory Recompiling openMPI on the RHEL server and making sure the locations of the orterun binary are the same as on the compute nodes finally fixed the problem. 6) in addition to rebooting nodes also use "sudo scontrol reconfigure" to make sure that the slurm.conf file is reloaded on compute nodes. we kept getting jobs stuck in completing state due to a uid/gid problem. Which showed the following error: [Dec 31 12:58:22] debug2: Processing RPC: REQUEST_TERMINATE_JOB [Dec 31 12:58:22] debug: _rpc_terminate_job, uid = 11667 [Dec 31 12:58:22] error: Security violation: kill_job(2) from uid 11667 this problem was finally resolved by rebooting all the compute nodes and using sudo scontrol reconfigure on all the nodes. 7) verify each component independently. per Dirk: basic MPI with one of the helloWorld examples. Then Rmpi. Then snow. Then slurm. this allowed me to find the ssh problem with MPI, since slurm/munge are happy to authenticate with their shared keys rather than using ssh.
_________________________________________________________________________________________________________________________ I hope this checklist can serve as a useful guide for anyone who faces the harrowing task of setting up a cluster. Now that the hard part is done we are seeing close to linear speedups on our simulations, so the end result is worth the pain. The next chore for me is node maintenance. Dirk has suggested dsh (dancer's shell): http://www.netfort.gr.jp/~dancer/software/dsh.html.en and Moe at LLNL has suggested pdsh https://sourceforge.net/projects/pdsh/. If anyone has any additional suggestions, I would love to hear about it. Cheers, Whit _______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Department of Statistics & Actuarial Sciences Fax Phone#:(519)-661-3813 The University of Western Ontario Office Phone#:(519)-661-3622 London, Ontario N6A 5B7 http://www.stats.uwo.ca/faculty/yu