Rmpi and cpu usage on slaves

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20090421/94036288/attachment.pl>
| I am running sge6.2, openmpi 1.3.1, and Rmpi 0.5.7 on openSUSE linux.  I can
| start up an arbitrarily-sized cluster using sge, see the appropriate
| universe.size using Rmpi, and start a cluster using mpi.spawn.Rslaves().
| However, it appears that all the slaves then run at 100% cpu on all nodes.
| Even using Rmpi under openmpi with a simple hostfile produces the same
| result.  Any suggestions to figure out what is going on on the slaves?

There is a known issue with Open MPI and blocking which you may be hitting
here.  Upstream Open MPI considers it a feature. But as this has come up a
few times on their mailing list as well, I believe the last word was that it
will go away in a future release.

Hth, Dirk

| Thanks,
| Sean
| 
| 
| > library(Rmpi)
| library(Rmpi)
| > mpi.universe.size()
| mpi.universe.size()
| [1] 24
| > mpi.spawn.Rslaves()
| mpi.spawn.Rslaves()
|         24 slaves are spawned successfully. 0 failed.
| master  (rank 0 , comm 1) of size 25 is running on: Mahfouz
| slave1  (rank 1 , comm 1) of size 25 is running on: Mahfouz
| slave2  (rank 2 , comm 1) of size 25 is running on: Mahfouz
| slave3  (rank 3 , comm 1) of size 25 is running on: Mahfouz
| slave4  (rank 4 , comm 1) of size 25 is running on: Mahfouz
| slave5  (rank 5 , comm 1) of size 25 is running on: Mahfouz
| slave6  (rank 6 , comm 1) of size 25 is running on: Mahfouz
| slave7  (rank 7 , comm 1) of size 25 is running on: Mahfouz
| slave8  (rank 8 , comm 1) of size 25 is running on: Grass
| slave9  (rank 9 , comm 1) of size 25 is running on: Grass
| slave10 (rank 10, comm 1) of size 25 is running on: Grass
| slave11 (rank 11, comm 1) of size 25 is running on: Grass
| slave12 (rank 12, comm 1) of size 25 is running on: Grass
| slave13 (rank 13, comm 1) of size 25 is running on: Grass
| slave14 (rank 14, comm 1) of size 25 is running on: Grass
| slave15 (rank 15, comm 1) of size 25 is running on: Grass
| slave16 (rank 16, comm 1) of size 25 is running on: shakespeare
| slave17 (rank 17, comm 1) of size 25 is running on: shakespeare
| slave18 (rank 18, comm 1) of size 25 is running on: shakespeare
| slave19 (rank 19, comm 1) of size 25 is running on: shakespeare
| slave20 (rank 20, comm 1) of size 25 is running on: shakespeare
| slave21 (rank 21, comm 1) of size 25 is running on: shakespeare
| slave22 (rank 22, comm 1) of size 25 is running on: shakespeare
| slave23 (rank 23, comm 1) of size 25 is running on: shakespeare
| slave24 (rank 24, comm 1) of size 25 is running on: Mahfouz
| > mpi.close.Rslaves()
| mpi.close.Rslaves()
| [1] 1
| 
| > sessionInfo()    # on the master
| R version 2.9.0 Under development (unstable) (2009-02-21 r47969)
| x86_64-unknown-linux-gnu
| 
| locale:
| LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
| 
| attached base packages:
| [1] stats     graphics  grDevices utils     datasets  methods   base
| 
| other attached packages:
| [1] Rmpi_0.5-7
| 
| 	[[alternative HTML version deleted]]
| 
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Three out of two people have difficulties with fractions.
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20090422/dd3f6f88/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20090422/d82242b5/attachment.pl>
As Dirk said, it is a feature of OpenMPI. LAM-MPI doesn't have this issue.
I don't think there is a solution on slave sides since mpi.bcast is a
blocking call. It might be possible to use nonblocking point-to-point
calls such as mpi.ireiv with Sys.sleep command but the whole-slave
communications must be rewritten. If Dirk is correct, future release of
openmpi will remove such a feature. This is why I did not try to work out
a solution, at least on slave sides. In real computation, all slaves are
supposed to use up all assigned cpu cycles.

The same issue will be applied to master as well if any of parallel apply
functions are used. In Rmpi 0.4-7 several nonblock parallel apply
functions are added so master will not consume 100%cpu while waiting.

So far LAM-MPI is still the best environment for programing, debugging and
testing.

Hao
On 21 April 2009 at 16:40, Sean Davis wrote:
| I am running sge6.2, openmpi 1.3.1, and Rmpi 0.5.7 on openSUSE linux.  I
can
| start up an arbitrarily-sized cluster using sge, see the appropriate
| universe.size using Rmpi, and start a cluster using mpi.spawn.Rslaves().
| However, it appears that all the slaves then run at 100% cpu on all
nodes.
| Even using Rmpi under openmpi with a simple hostfile produces the same
| result.  Any suggestions to figure out what is going on on the slaves?

There is a known issue with Open MPI and blocking which you may be hitting
here.  Upstream Open MPI considers it a feature. But as this has come up a
few times on their mailing list as well, I believe the last word was that
it
will go away in a future release.

Hth, Dirk

| Thanks,
| Sean
|
|
| > library(Rmpi)
| library(Rmpi)
| > mpi.universe.size()
| mpi.universe.size()
| [1] 24
| > mpi.spawn.Rslaves()
| mpi.spawn.Rslaves()
|         24 slaves are spawned successfully. 0 failed.
| master  (rank 0 , comm 1) of size 25 is running on: Mahfouz
| slave1  (rank 1 , comm 1) of size 25 is running on: Mahfouz
| slave2  (rank 2 , comm 1) of size 25 is running on: Mahfouz
| slave3  (rank 3 , comm 1) of size 25 is running on: Mahfouz
| slave4  (rank 4 , comm 1) of size 25 is running on: Mahfouz
| slave5  (rank 5 , comm 1) of size 25 is running on: Mahfouz
| slave6  (rank 6 , comm 1) of size 25 is running on: Mahfouz
| slave7  (rank 7 , comm 1) of size 25 is running on: Mahfouz
| slave8  (rank 8 , comm 1) of size 25 is running on: Grass
| slave9  (rank 9 , comm 1) of size 25 is running on: Grass
| slave10 (rank 10, comm 1) of size 25 is running on: Grass
| slave11 (rank 11, comm 1) of size 25 is running on: Grass
| slave12 (rank 12, comm 1) of size 25 is running on: Grass
| slave13 (rank 13, comm 1) of size 25 is running on: Grass
| slave14 (rank 14, comm 1) of size 25 is running on: Grass
| slave15 (rank 15, comm 1) of size 25 is running on: Grass
| slave16 (rank 16, comm 1) of size 25 is running on: shakespeare
| slave17 (rank 17, comm 1) of size 25 is running on: shakespeare
| slave18 (rank 18, comm 1) of size 25 is running on: shakespeare
| slave19 (rank 19, comm 1) of size 25 is running on: shakespeare
| slave20 (rank 20, comm 1) of size 25 is running on: shakespeare
| slave21 (rank 21, comm 1) of size 25 is running on: shakespeare
| slave22 (rank 22, comm 1) of size 25 is running on: shakespeare
| slave23 (rank 23, comm 1) of size 25 is running on: shakespeare
| slave24 (rank 24, comm 1) of size 25 is running on: Mahfouz
| > mpi.close.Rslaves()
| mpi.close.Rslaves()
| [1] 1
|
| > sessionInfo()    # on the master
| R version 2.9.0 Under development (unstable) (2009-02-21 r47969)
| x86_64-unknown-linux-gnu
|
| locale:
|
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
|
| attached base packages:
| [1] stats     graphics  grDevices utils     datasets  methods   base
|
| other attached packages:
| [1] Rmpi_0.5-7
|
| 	[[alternative HTML version deleted]]
|
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

--
Three out of two people have difficulties with fractions.

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Department of Statistics & Actuarial Sciences
Fax Phone#:(519)-661-3813
The University of Western Ontario
Office Phone#:(519)-661-3622
London, Ontario N6A 5B7
http://www.stats.uwo.ca/faculty/yu
So, as Dirk suggested, the 100% CPU usage is thought to be a feature
and not
a bug.
As someone who just ran into this (the link to the thread was in the
earlier links), I looked into why OpenMPI worked that way.  I think a
fairer characterization is that responsiveness while running Open MPI is
a feature; the 100% CPU useage is just a side effect.  Fixing it is on
their todo list, but it's not a high priority because the fix is a bit
tricky and the useage scenario that strikes the developers as standard
reserves the CPUs for the job anyway.

I don't happen to fit that scenario, but I suspect the developers are
right in their judgement of typical use.

If you're concerned, a number of work-arounds or brittle fixes are in
the openmpi archives; the responses to my query
(http://www.open-mpi.org/community/lists/users/2009/04/9016.php) have
pointers to a couple of them.

Ross
Ross Boylan                                      wk:  (415) 514-8146
185 Berry St #5700                               ross at biostat.ucsf.edu
Dept of Epidemiology and Biostatistics           fax: (415) 514-8150
University of California, San Francisco
San Francisco, CA 94107-1739                     hm:  (415) 550-1062