An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20090421/94036288/attachment.pl>
Rmpi and cpu usage on slaves
6 messages · Dirk Eddelbuettel, Sean Davis, Hao Yu +1 more
On 21 April 2009 at 16:40, Sean Davis wrote:
| I am running sge6.2, openmpi 1.3.1, and Rmpi 0.5.7 on openSUSE linux. I can | start up an arbitrarily-sized cluster using sge, see the appropriate | universe.size using Rmpi, and start a cluster using mpi.spawn.Rslaves(). | However, it appears that all the slaves then run at 100% cpu on all nodes. | Even using Rmpi under openmpi with a simple hostfile produces the same | result. Any suggestions to figure out what is going on on the slaves? There is a known issue with Open MPI and blocking which you may be hitting here. Upstream Open MPI considers it a feature. But as this has come up a few times on their mailing list as well, I believe the last word was that it will go away in a future release. Hth, Dirk | Thanks, | Sean | | | > library(Rmpi) | library(Rmpi) | > mpi.universe.size() | mpi.universe.size() | [1] 24 | > mpi.spawn.Rslaves() | mpi.spawn.Rslaves() | 24 slaves are spawned successfully. 0 failed. | master (rank 0 , comm 1) of size 25 is running on: Mahfouz | slave1 (rank 1 , comm 1) of size 25 is running on: Mahfouz | slave2 (rank 2 , comm 1) of size 25 is running on: Mahfouz | slave3 (rank 3 , comm 1) of size 25 is running on: Mahfouz | slave4 (rank 4 , comm 1) of size 25 is running on: Mahfouz | slave5 (rank 5 , comm 1) of size 25 is running on: Mahfouz | slave6 (rank 6 , comm 1) of size 25 is running on: Mahfouz | slave7 (rank 7 , comm 1) of size 25 is running on: Mahfouz | slave8 (rank 8 , comm 1) of size 25 is running on: Grass | slave9 (rank 9 , comm 1) of size 25 is running on: Grass | slave10 (rank 10, comm 1) of size 25 is running on: Grass | slave11 (rank 11, comm 1) of size 25 is running on: Grass | slave12 (rank 12, comm 1) of size 25 is running on: Grass | slave13 (rank 13, comm 1) of size 25 is running on: Grass | slave14 (rank 14, comm 1) of size 25 is running on: Grass | slave15 (rank 15, comm 1) of size 25 is running on: Grass | slave16 (rank 16, comm 1) of size 25 is running on: shakespeare | slave17 (rank 17, comm 1) of size 25 is running on: shakespeare | slave18 (rank 18, comm 1) of size 25 is running on: shakespeare | slave19 (rank 19, comm 1) of size 25 is running on: shakespeare | slave20 (rank 20, comm 1) of size 25 is running on: shakespeare | slave21 (rank 21, comm 1) of size 25 is running on: shakespeare | slave22 (rank 22, comm 1) of size 25 is running on: shakespeare | slave23 (rank 23, comm 1) of size 25 is running on: shakespeare | slave24 (rank 24, comm 1) of size 25 is running on: Mahfouz | > mpi.close.Rslaves() | mpi.close.Rslaves() | [1] 1 | | > sessionInfo() # on the master | R version 2.9.0 Under development (unstable) (2009-02-21 r47969) | x86_64-unknown-linux-gnu | | locale: | LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C | | attached base packages: | [1] stats graphics grDevices utils datasets methods base | | other attached packages: | [1] Rmpi_0.5-7 | | [[alternative HTML version deleted]] | | _______________________________________________ | R-sig-hpc mailing list | R-sig-hpc at r-project.org | https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Three out of two people have difficulties with fractions.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20090422/dd3f6f88/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20090422/d82242b5/attachment.pl>
As Dirk said, it is a feature of OpenMPI. LAM-MPI doesn't have this issue. I don't think there is a solution on slave sides since mpi.bcast is a blocking call. It might be possible to use nonblocking point-to-point calls such as mpi.ireiv with Sys.sleep command but the whole-slave communications must be rewritten. If Dirk is correct, future release of openmpi will remove such a feature. This is why I did not try to work out a solution, at least on slave sides. In real computation, all slaves are supposed to use up all assigned cpu cycles. The same issue will be applied to master as well if any of parallel apply functions are used. In Rmpi 0.4-7 several nonblock parallel apply functions are added so master will not consume 100%cpu while waiting. So far LAM-MPI is still the best environment for programing, debugging and testing. Hao
Dirk Eddelbuettel wrote:
On 21 April 2009 at 16:40, Sean Davis wrote: | I am running sge6.2, openmpi 1.3.1, and Rmpi 0.5.7 on openSUSE linux. I can | start up an arbitrarily-sized cluster using sge, see the appropriate | universe.size using Rmpi, and start a cluster using mpi.spawn.Rslaves(). | However, it appears that all the slaves then run at 100% cpu on all nodes. | Even using Rmpi under openmpi with a simple hostfile produces the same | result. Any suggestions to figure out what is going on on the slaves? There is a known issue with Open MPI and blocking which you may be hitting here. Upstream Open MPI considers it a feature. But as this has come up a few times on their mailing list as well, I believe the last word was that it will go away in a future release. Hth, Dirk | Thanks, | Sean | | | > library(Rmpi) | library(Rmpi) | > mpi.universe.size() | mpi.universe.size() | [1] 24 | > mpi.spawn.Rslaves() | mpi.spawn.Rslaves() | 24 slaves are spawned successfully. 0 failed. | master (rank 0 , comm 1) of size 25 is running on: Mahfouz | slave1 (rank 1 , comm 1) of size 25 is running on: Mahfouz | slave2 (rank 2 , comm 1) of size 25 is running on: Mahfouz | slave3 (rank 3 , comm 1) of size 25 is running on: Mahfouz | slave4 (rank 4 , comm 1) of size 25 is running on: Mahfouz | slave5 (rank 5 , comm 1) of size 25 is running on: Mahfouz | slave6 (rank 6 , comm 1) of size 25 is running on: Mahfouz | slave7 (rank 7 , comm 1) of size 25 is running on: Mahfouz | slave8 (rank 8 , comm 1) of size 25 is running on: Grass | slave9 (rank 9 , comm 1) of size 25 is running on: Grass | slave10 (rank 10, comm 1) of size 25 is running on: Grass | slave11 (rank 11, comm 1) of size 25 is running on: Grass | slave12 (rank 12, comm 1) of size 25 is running on: Grass | slave13 (rank 13, comm 1) of size 25 is running on: Grass | slave14 (rank 14, comm 1) of size 25 is running on: Grass | slave15 (rank 15, comm 1) of size 25 is running on: Grass | slave16 (rank 16, comm 1) of size 25 is running on: shakespeare | slave17 (rank 17, comm 1) of size 25 is running on: shakespeare | slave18 (rank 18, comm 1) of size 25 is running on: shakespeare | slave19 (rank 19, comm 1) of size 25 is running on: shakespeare | slave20 (rank 20, comm 1) of size 25 is running on: shakespeare | slave21 (rank 21, comm 1) of size 25 is running on: shakespeare | slave22 (rank 22, comm 1) of size 25 is running on: shakespeare | slave23 (rank 23, comm 1) of size 25 is running on: shakespeare | slave24 (rank 24, comm 1) of size 25 is running on: Mahfouz | > mpi.close.Rslaves() | mpi.close.Rslaves() | [1] 1 | | > sessionInfo() # on the master | R version 2.9.0 Under development (unstable) (2009-02-21 r47969) | x86_64-unknown-linux-gnu | | locale: | LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C | | attached base packages: | [1] stats graphics grDevices utils datasets methods base | | other attached packages: | [1] Rmpi_0.5-7 | | [[alternative HTML version deleted]] | | _______________________________________________ | R-sig-hpc mailing list | R-sig-hpc at r-project.org | https://stat.ethz.ch/mailman/listinfo/r-sig-hpc -- Three out of two people have difficulties with fractions.
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Department of Statistics & Actuarial Sciences Fax Phone#:(519)-661-3813 The University of Western Ontario Office Phone#:(519)-661-3622 London, Ontario N6A 5B7 http://www.stats.uwo.ca/faculty/yu
1 day later
On Wed, 2009-04-22 at 12:55 -0400, Sean Davis wrote:
So, as Dirk suggested, the 100% CPU usage is thought to be a feature and not a bug.
As someone who just ran into this (the link to the thread was in the earlier links), I looked into why OpenMPI worked that way. I think a fairer characterization is that responsiveness while running Open MPI is a feature; the 100% CPU useage is just a side effect. Fixing it is on their todo list, but it's not a high priority because the fix is a bit tricky and the useage scenario that strikes the developers as standard reserves the CPUs for the job anyway. I don't happen to fit that scenario, but I suspect the developers are right in their judgement of typical use. If you're concerned, a number of work-arounds or brittle fixes are in the openmpi archives; the responses to my query (http://www.open-mpi.org/community/lists/users/2009/04/9016.php) have pointers to a couple of them. Ross
Ross Boylan wk: (415) 514-8146 185 Berry St #5700 ross at biostat.ucsf.edu Dept of Epidemiology and Biostatistics fax: (415) 514-8150 University of California, San Francisco San Francisco, CA 94107-1739 hm: (415) 550-1062