Skip to content

parallel and openblas

20 messages · Claudia Beleites, Whit Armstrong, Martin Renner +4 more

#
Parallel and openblas don't seem to mix well on my machine. If I link openblas, a job executed through parallel (using either the multicore or snow (local socket cluster) setup), each of my 8 cores only operates at 1/8 of 100% (taking a little longer than serial execution). Linking to the reference blas or to single-threaded atlas does not cause this handicap when running snow or multicore. 

Is this a known problem (My google attempts were fruitless)? If yes, is there a fix for it? Do MKL or multi-threaded atlas have the same issues? 

Thank you for your time. 

Martin



Martin Renner
Post-doctoral Fellow				phone: 907-226 4672
University of Washington			   or: 907-235 0728
School of Aquatic and Fishery Sciences		Seattle, USA





debian squeeze on 8-core Xeon
R version 2.15.0 (2012-03-30)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
[3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
[5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
[7] LC_PAPER=C                 LC_NAME=C                 
[9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
#
Martin, do you actually know each core works with 1/8th (as opposed to
all 8 threads are run on the same core)?

Is implicitly parallel stuff (e.g. %*%) without package parallel working OK?

Claudia

Am 23.04.2012 21:53, schrieb Martin Renner:

  
    
#
I believe you can set an env variable to determine the number of
threads to use.  Perhaps search the openblas doc.

Alternatively, execute on 8 different machines...

But for that you might need... rzmq.

-Whit
On Mon, Apr 23, 2012 at 3:53 PM, Martin Renner <greatauklet at gmail.com> wrote:
#
Yes, Claudia, you're probaply right that all these threads are on one core. Yes, %*% is working as expected, utilizing all available cores, when R is linked to openblas.

Martin
On 23 Apr 2012, at 12:07 , Claudia Beleites wrote:

            
#
Martin, 

I possibly have/had the same problem (on a CentOS 5 system). 
The R was assigned to one core only, and the solution is:
 system(sprintf('taskset -p 0xffffffff %d', Sys.getpid()))

The whole thread is this: https://stat.ethz.ch/pipermail/r-sig-hpc/2011-November/001171.html

Claudia
#
Hallo Claudia,

Thank you for that hint -- this works! A bit more permanent solution would be nice though; will need to look into that. 

Best,
Martin



Martin Renner
Post-doctoral Fellow				phone: 907-226 4672
University of Washington			   or: 907-235 0728
School of Aquatic and Fishery Sciences		Seattle, USA
On 24 Apr 2012, at 02:29 , beleites,claudia wrote:

            
#
Martin,

please let me know the permanent solution - I'm still using taskset all the time (I don't have admin rights, but our admin would do it if I could tell him what exactly I need.).

Claudia
#
There's an interesting discussion entitled "all processes run on
one CPU core" at:

    https://github.com/ipython/ipython/issues/840

Someone was experiencing a very similar problem to the one that
Claudia described using GotoBLAS2 with IPython and NumPy.
Apparently it was fixed by recompiling GotoBLAS2 with the
"NO_AFFINITY" parameter set to "1" in Makefile.rule, and then
rebuilding "NumPy".

It seems pretty strange, but GotoBLAS2/OpenBLAS may be modifying
the affinity of the R process by calling sched_setaffinity() when
it is initialized, and that is causing the problems that Claudia
and Martin have seen.

So perhaps the solution is to recompile GotoBLAS2/OpenBLAS with
NO_AFFINITY=1, and then rebuild R with it.

- Steve


On Tue, Apr 24, 2012 at 12:00 PM, beleites,claudia
<claudia.beleites at ipht-jena.de> wrote:
#
On Tue, Apr 24, 2012 at 12:45 PM, Stephen Weston
<stephen.b.weston at gmail.com> wrote:
I haven't had a chance to do any testing with multi-core (and
multi-thread on my Intel i5) but I have found a few glitches in R with
Atlas on openSUSE. Once I actually have details on how to patch the
source RPMs I'll be filing some bugs in openSUSE, which should be
similar to what other RPM-based systems (Fedora,
RHEL/CentOS/Scientific Linux) need to do to make all the magic work.

The "permanent solution" on Linux *should* be for some community
packager(s) to package all of the tools within the distros. Dirk
Eddelbuettel has been doing it for Debian / Ubuntu as long as I can
remember, but the other distros haven't been so fortunate. I'm
nibbling around the edges on openSUSE, though. We'll have some of this
in openSUSE 12.2 if I have anything to say about it. ;-)
#
On 24 April 2012 at 15:45, Stephen Weston wrote:
| There's an interesting discussion entitled "all processes run on
| one CPU core" at:
| 
|     https://github.com/ipython/ipython/issues/840
| 
| Someone was experiencing a very similar problem to the one that
| Claudia described using GotoBLAS2 with IPython and NumPy.
| Apparently it was fixed by recompiling GotoBLAS2 with the
| "NO_AFFINITY" parameter set to "1" in Makefile.rule, and then
| rebuilding "NumPy".
| 
| It seems pretty strange, but GotoBLAS2/OpenBLAS may be modifying
| the affinity of the R process by calling sched_setaffinity() when
| it is initialized, and that is causing the problems that Claudia
| and Martin have seen.
| 
| So perhaps the solution is to recompile GotoBLAS2/OpenBLAS with
| NO_AFFINITY=1, and then rebuild R with it.

Good discussion, but one important nit: never a need to rebuild a R (provided
you have external / dynamically linked BLAS). 

Just restart R.

Dirk
#
On 24 April 2012 at 16:53, M. Edward (Ed) Borasky wrote:
| On Tue, Apr 24, 2012 at 12:45 PM, Stephen Weston
| <stephen.b.weston at gmail.com> wrote:
| > There's an interesting discussion entitled "all processes run on
| > one CPU core" at:
| >
| > ? ?https://github.com/ipython/ipython/issues/840
| >
| > Someone was experiencing a very similar problem to the one that
| > Claudia described using GotoBLAS2 with IPython and NumPy.
| > Apparently it was fixed by recompiling GotoBLAS2 with the
| > "NO_AFFINITY" parameter set to "1" in Makefile.rule, and then
| > rebuilding "NumPy".
| >
| > It seems pretty strange, but GotoBLAS2/OpenBLAS may be modifying
| > the affinity of the R process by calling sched_setaffinity() when
| > it is initialized, and that is causing the problems that Claudia
| > and Martin have seen.
| >
| > So perhaps the solution is to recompile GotoBLAS2/OpenBLAS with
| > NO_AFFINITY=1, and then rebuild R with it.
| >
| > - Steve
| 
| I haven't had a chance to do any testing with multi-core (and
| multi-thread on my Intel i5) but I have found a few glitches in R with
| Atlas on openSUSE. Once I actually have details on how to patch the
| source RPMs I'll be filing some bugs in openSUSE, which should be
| similar to what other RPM-based systems (Fedora,
| RHEL/CentOS/Scientific Linux) need to do to make all the magic work.
| 
| The "permanent solution" on Linux *should* be for some community
| packager(s) to package all of the tools within the distros. Dirk
| Eddelbuettel has been doing it for Debian / Ubuntu as long as I can

Too much credit. I "merely" take care of R (and Octave earlier on). This is
distributed work, and the Atlas (and other BLAS) maintainers, starting with
Camm and now Sylvestre, are doing an amazing job. I simply know how to reuse
that to R's benefit.

Dirk
#
On Tue, Apr 24, 2012 at 5:39 PM, Dirk Eddelbuettel <edd at debian.org> wrote:

            
Speaking of packaging for distros, it looks like the OpenSUSE Build
Service has some semi-automated setup for grabbing Perl packages from
CPAN and wrapping them up as RPMs. I haven't dug into it, but I've
been seeing all sorts of useful things show up in the repositories. If
it can be done for CPAN using the OpenSUSE Build Service testing /
building infrastructure, it ought to be possible for CRAN as well. And
the infrastructure is capable of packaging for all the major distros,
not just openSUSE.
#
This is getting off-topic for r-sig-hpc ...
On 24 April 2012 at 19:10, M. Edward (Ed) Borasky wrote:
| On Tue, Apr 24, 2012 at 5:39 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
| 
| > Too much credit. I "merely" take care of R (and Octave earlier on). This is
| > distributed work, and the Atlas (and other BLAS) maintainers, starting with
| > Camm and now Sylvestre, are doing an amazing job. I simply know how to reuse
| > that to R's benefit.
| 
| Speaking of packaging for distros, it looks like the OpenSUSE Build
| Service has some semi-automated setup for grabbing Perl packages from
| CPAN and wrapping them up as RPMs. I haven't dug into it, but I've
| been seeing all sorts of useful things show up in the repositories. If
| it can be done for CPAN using the OpenSUSE Build Service testing /
| building infrastructure, it ought to be possible for CRAN as well. And
| the infrastructure is capable of packaging for all the major distros,
| not just openSUSE.

Go for it. 

After roughly a decade's work, and around four or five different attempts, we
now have an archive for 'apt-get r-cran-$ANYTHING' for Ubuntu (that is,
Michael Rutter's PPA on launchpad covering different Ubuntu builds) as well
as one for Debian (that is, Don Armstrong's debian-r.debian.net covering
Debian testing and now stable as well).

So it can be done, but there is a lot of detailed work underneath it. And
it's nice to have, especially for clusters as it minimizes per-node work and
keeps them in sync.

And it all started with the cran2deb work we did based on Albrecht's script
written initially for ... OpenSUSE and then ported to Debian. What goes
around comes around.

Dirk
#
I was able to confirm that when I built R using OpenBLAS on my
Linux machine, my CPU affinity was modified right at the
beginning of the R session:

  $ grep Cpus_allowed /proc/self/status
  Cpus_allowed: ffffffff,ffffffff
  Cpus_allowed_list:    0-63
  $ bin/R
  > readLines('/proc/self/status')[32]
  [1] "Cpus_allowed:\t00000000,00000001"

I then confirmed that this causes problems for parallel packages
such as "parallel" by trying to use all six cores of my machine
using the "mclapply" function:

  > library(parallel)
  > cores <- detectCores()
  > mclapply(1:cores, function(i) repeat sqrt(3.14159), mc.cores=cores)

When I executed "top" from another window and pressed "1", it
showed that only one core was being used, and there were six R
sessions, each getting 17% of the CPU.

I also confirmed that "Cpus_allowed" was being set to the same
value for each of the workers:

  > mclapply(1:cores, function(i) readLines('/proc/self/status')[32],
mc.cores=cores)
  [[1]]
  [1] "Cpus_allowed:\t00000000,00000001"

  [[2]]
  [1] "Cpus_allowed:\t00000000,00000001"

  [[3]]
  [1] "Cpus_allowed:\t00000000,00000001"

  [[4]]
  [1] "Cpus_allowed:\t00000000,00000001"

  [[5]]
  [1] "Cpus_allowed:\t00000000,00000001"

  [[6]]
  [1] "Cpus_allowed:\t00000000,00000001"

That is definitely not what you want to see, and explains why
"mclapply" is only able to use one core.

When I rebuilt and reinstalled OpenBLAS after editing
Makefile.rule so that it contained the line:

  NO_AFFINITY = 1

and then restarted R, the problem went away:

  $ bin/R
  > readLines('/proc/self/status')[32]
  [1] "Cpus_allowed:\tffffffff,ffffffff"

This time when I ran "mclapply", "top" confirmed that I was
using all six cores at about 100%.

I didn't try this experiment with the older GotoBLAS2, but I
believe the results would be the same.

- Steve
On Tue, Apr 24, 2012 at 8:37 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
#
[ Sylvestre: I am tossing you in the middle of a thread here. We may have a
buglet in OpenBLAS where NO_AFFINITY=1 might be a good config value to add. ]

Steve,

Nice work, and mostly confirming against OpenBLAS, Atlas and a local (old)
GotoBLAS2.
On 25 April 2012 at 10:24, Stephen Weston wrote:
| I was able to confirm that when I built R using OpenBLAS on my
| Linux machine, my CPU affinity was modified right at the
| beginning of the R session:
| 
|   $ grep Cpus_allowed /proc/self/status
|   Cpus_allowed: ffffffff,ffffffff
|   Cpus_allowed_list:    0-63
|   $ bin/R
|   > readLines('/proc/self/status')[32]
|   [1] "Cpus_allowed:\t00000000,00000001"

What kernel is that?  On 3.0.0-17 (Ubuntu 11.10, infrequently rebooted) I get 

  edd at max:~$ grep Cpus_allowed /proc/self/status
  Cpus_allowed:   ff
  Cpus_allowed_list:      0-7
  edd at max:~$

| I then confirmed that this causes problems for parallel packages
| such as "parallel" by trying to use all six cores of my machine
| using the "mclapply" function:
| 
|   > library(parallel)
|   > cores <- detectCores()
|   > mclapply(1:cores, function(i) repeat sqrt(3.14159), mc.cores=cores)
| 
| When I executed "top" from another window and pressed "1", it
| showed that only one core was being used, and there were six R
| sessions, each getting 17% of the CPU.

When I run these three commands as a single line for r (from the littler package)

  edd at max:~$ r -e 'library(parallel);  cores <- detectCores(); print(cores); mclapply(1:cores, function(i) repeat sqrt(3.14159), mc.cores=cores)'
  [1] 8
  ^C
  edd at max:~$ 

I also get just one core covered. That is with 

  edd at max:~$ COLUMNS=94 dpkg -l|grep "blas\|atlas" | cut -c-78
  ii  gotoblas2-helper  0.1-12.local.1    GotoBLAS2 helper
  ii  libblas-dev       1.2.20110419-2ubu Basic Linear Algebra Subroutines 3, st
  ii  libblas-test      1.2.20110419-2ubu Basic Linear Algebra Subroutines 3, te
  ii  libblas3gf        1.2.20110419-2ubu Basic Linear Algebra Reference impleme
  ii  libopenblas-base  0.1alpha2.2-3     Optimized BLAS (linear algebra) librar
  ii  libopenblas-dev   0.1alpha2.2-3     Optimized BLAS (linear algebra) librar
  edd at max:~$ 

where OpenBLAS provides BLAS as default.

That was after I had removed Atlas which is still my default. So if I
reiinstall Atlas (which "ranks higher" in the defaults and hence replaces
OpenBLAS) everything is fine -- eight cores used.

  edd at max:~$ COLUMNS=94 dpkg -l|grep "blas\|atlas" | cut -c-78
  ii  gotoblas2-helper  0.1-12.local.1    GotoBLAS2 helper
  ii  libatlas-base-dev 3.8.4-3build1     Automatically Tuned Linear Algebra Sof
  ii  libatlas-dev      3.8.4-3build1     Automatically Tuned Linear Algebra Sof
  ii  libatlas3gf-base  3.8.4-3build1     Automatically Tuned Linear Algebra Sof
  ii  libblas-dev       1.2.20110419-2ubu Basic Linear Algebra Subroutines 3, st
  ii  libblas-test      1.2.20110419-2ubu Basic Linear Algebra Subroutines 3, te
  ii  libblas3gf        1.2.20110419-2ubu Basic Linear Algebra Reference impleme
  ii  libopenblas-base  0.1alpha2.2-3     Optimized BLAS (linear algebra) librar
  ii  libopenblas-dev   0.1alpha2.2-3     Optimized BLAS (linear algebra) librar
  edd at max:~$ 
 
 
| I also confirmed that "Cpus_allowed" was being set to the same
| value for each of the workers:
| 
|   > mclapply(1:cores, function(i) readLines('/proc/self/status')[32],
| mc.cores=cores)
|   [[1]]
|   [1] "Cpus_allowed:\t00000000,00000001"
| 
|   [[2]]
|   [1] "Cpus_allowed:\t00000000,00000001"
| 
|   [[3]]
|   [1] "Cpus_allowed:\t00000000,00000001"
| 
|   [[4]]
|   [1] "Cpus_allowed:\t00000000,00000001"
| 
|   [[5]]
|   [1] "Cpus_allowed:\t00000000,00000001"
| 
|   [[6]]
|   [1] "Cpus_allowed:\t00000000,00000001"
| 
| That is definitely not what you want to see, and explains why
| "mclapply" is only able to use one core.
| 
| When I rebuilt and reinstalled OpenBLAS after editing
| Makefile.rule so that it contained the line:
| 
|   NO_AFFINITY = 1
| 
| and then restarted R, the problem went away:
| 
|   $ bin/R
|   > readLines('/proc/self/status')[32]
|   [1] "Cpus_allowed:\tffffffff,ffffffff"
| 
| This time when I ran "mclapply", "top" confirmed that I was
| using all six cores at about 100%.
| 
| I didn't try this experiment with the older GotoBLAS2, but I
| believe the results would be the same.

I can confirm this. Using the packages 

edd at max:~$ COLUMNS=94 dpkg -l|grep "blas\|atlas" | cut -c-78
ii  gotoblas2         1.13-1            GotoBLAS2
ii  gotoblas2-helper  0.1-12.local.1    GotoBLAS2 helper
ii  libblas-dev       1.2.20110419-2ubu Basic Linear Algebra Subroutines 3, st
ii  libblas-test      1.2.20110419-2ubu Basic Linear Algebra Subroutines 3, te
ii  libblas3gf        1.2.20110419-2ubu Basic Linear Algebra Reference impleme
edd at max:~$

where the GotoBLAS2 (locally built, using the gotoblas2-helper package) now
provide BLAS, everything sticks to one core when running the mclapply.  

I guess I'd need to fix gotoblas2-helper and rebuild the gotoblas2.  Or stick
with / hope for a corrected OpenBLAS build.

Dirk
 
| - Steve
| 
|
| On Tue, Apr 24, 2012 at 8:37 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
| >
| > On 24 April 2012 at 15:45, Stephen Weston wrote:
| > | There's an interesting discussion entitled "all processes run on
| > | one CPU core" at:
| > |
| > | ? ? https://github.com/ipython/ipython/issues/840
| > |
| > | Someone was experiencing a very similar problem to the one that
| > | Claudia described using GotoBLAS2 with IPython and NumPy.
| > | Apparently it was fixed by recompiling GotoBLAS2 with the
| > | "NO_AFFINITY" parameter set to "1" in Makefile.rule, and then
| > | rebuilding "NumPy".
| > |
| > | It seems pretty strange, but GotoBLAS2/OpenBLAS may be modifying
| > | the affinity of the R process by calling sched_setaffinity() when
| > | it is initialized, and that is causing the problems that Claudia
| > | and Martin have seen.
| > |
| > | So perhaps the solution is to recompile GotoBLAS2/OpenBLAS with
| > | NO_AFFINITY=1, and then rebuild R with it.
| >
| > Good discussion, but one important nit: never a need to rebuild a R (provided
| > you have external / dynamically linked BLAS).
| >
| > Just restart R.
| >
| > Dirk
| >
| > --
| > R/Finance 2012 Conference on May 11 and 12, 2012 at UIC in Chicago, IL
| > See agenda, registration details and more at http://www.RinFinance.com
#
Sorry to follow-up on my own post from minutes ago, but maybe it is better
the way it is as it allows to explicitly spread tasks each of which would
limit itself to one core.  Isn't this the best way to avoid "clogging" when
we use, say, snow to get to eight cores, and each of the eight cores wants to
do BLAS work on eight cores...  That was after all the initial issue in
Martin's email.

And when I use the affinity trick (now breaking the single line for display)

  edd at max:~$ r -e 'library(parallel); \
                   system(sprintf("taskset -p 0xffffffff %d", Sys.getpid())); \
                   cores <- detectCores(); \
                   print(cores); \
                   mclapply(1:cores, function(i) repeat sqrt(3.14159), mc.cores=cores)'
  pid 24436's current affinity mask: 1
  pid 24436's new affinity mask: ff
  [1] 8

all is well -- eight cores humming along just fine per htop under OpenBLAS.  

Maybe this is actually better as it effectively gives us a run-time toggle?

Dirk
#
On Apr 25, 2012, at 11:10 AM, Dirk Eddelbuettel wrote:

            
I'd argue that the problem is affinity of 1 by default - that is certainly not what you want or expect and bad in most cases. It would make much more sense to restrict the *children* to one CPU each rather than the parent process so that forked processes stay on their cores - that is IMHO the only place where setting affinity makes any sense at all. Also the system(..) trick is ugly and highly system-specific (you don't even know if you can access taskset). I would consider OpenBLAS' affinity setting as a pretty bad bug (certainly form user's point of view).

Cheers,
Simon
#
I agree with Simon.

- Not restricting the children doesn't lead to problems with both
implicit and explicit parallelization for me:
E.g. if I decide to run explicitly 3 parallel threads and have 12 cores,
I use GOTO_NUM_THREADS 4, and that works nicely.

- if users do not have access to taskset, then restriction to 1 core is
certainly more problematic than no restriction.

Claudia



Am 25.04.2012 17:21, schrieb Simon Urbanek:

  
    
#
FWIW: Based on this discussion, I have added the capability to manage CPU affinity mask in R-devel. In the short term you can use mcaffinity() to control the affinity yourself, so for example mcaffinity(1:ncores) will allow R to run on any core or you can restrict your parallel jobs to certain CPU ranges. Eventually, R/parallel should handle that for you by partitioning cores to child processes (we are missing mandatory core detection and tracking to do that), for now the closest you get is mc.affiinity in mcparallel() to spread children yourself.

Cheers,
Simon
On Apr 25, 2012, at 12:54 PM, Claudia Beleites wrote:

            
#
On 25 April 2012 at 14:38, Simon Urbanek wrote:
| FWIW: Based on this discussion, I have added the capability to manage CPU affinity mask in R-devel. In the short term you can use mcaffinity() to control the affinity yourself, so for example mcaffinity(1:ncores) will allow R to run on any core or you can restrict your parallel jobs to certain CPU ranges. Eventually, R/parallel should handle that for you by partitioning cores to child processes (we are missing mandatory core detection and tracking to do that), for now the closest you get is mc.affiinity in mcparallel() to spread children yourself.

That sounds like the great addition!  It is hard to find 'one size fits all'
settings, particularly for a distribution, and having control / a toggle to
adjust seems like a nice improvement.

Thanks, Dirk