Skip to content

bam (mgcv) not using the specified number of cores

6 messages · Andrew Crane-Droesch, Simon Wood

#
I am getting strange behavior when trying to fit models to large 
datasets using bam.  I am working on a 4-core machine, but I think that 
there may be 2 physical cores that the computer uses as 4 cores in some 
sense that I don't understand.

When I run the bam using makeCluster(3), the model runs on one core.  
But when I run it on makeCluster(2), top shoes me that three of my cores 
are taken up to full capacity, and my computer slows down or crashes.

How can I get it to truly run on 2 cores?

I'm on a thinkpad X230, running ubuntu.

Thanks,
Andrew
#
Hi Andrew,

Could you provide a bit more information, please. In particular the 
results of sessionInfo() and the code that caused this weird behaviour 
(+ an indication of dataset size).

best,
Simon
On 21/08/14 12:53, Andrew Crane-Droesch wrote:

  
    
#
Hi Simon,

Thanks for the reply.  I've tried to reproduce the error, but I don't 
know how to show output from `top` any other way than to attach 
screenshots, so please excuse that.

I'm attaching screenshots of what happens when I run with two and three 
cores.  In the former, it seems to be working on one core, and in the 
latter, it appears to be working on three.  When reproducing the error, 
I'm getting odd behavior that isn't entirely consistent -- sometimes it 
"behaves" and operates on the asked-for number of cores, and other times 
not.

I'm also attaching a screenshot showing terminal output from a remote 
cluster when I run my full model (N=67K) rather than a subset (N=7K) -- 
I get that error "Error in qr.qty(qrx, f) : right-hand side should have 
60650 not 118451 rows ..."  I suppose this is a memory overload 
problem?  Any suggestions on how to get bam to not call for more memory 
than the node has available would be welcome, though I suspect that is a 
supercomputing problem rather than a R problem.

Thanks,
Andrew

sessionInfo() for local machine:
1> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils datasets  methods
[8] base

other attached packages:
[1] mgcv_1.7-26  nlme_3.1-111

loaded via a namespace (and not attached):
[1] grid_3.0.2      lattice_0.20-23 Matrix_1.1-4
1>
On 08/21/2014 04:29 PM, Simon Wood wrote:
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2014-08-21 18:33:36.png
Type: image/png
Size: 576802 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20140821/41be1e87/attachment-0003.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2014-08-21 18:34:42.png
Type: image/png
Size: 387719 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20140821/41be1e87/attachment-0004.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Screenshot from 2014-08-21 18:36:08.png
Type: image/png
Size: 388816 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20140821/41be1e87/attachment-0005.png>
#
Hi Simon,

(resending with all images as imgur so as to not bounce from list)

Thanks for the reply.  I've tried to reproduce the error, but I don't 
know how to show output from `top` any other way than with screenshots, 
so please excuse that.

Here are screenshots of what happens when I run with two
http://imgur.com/i26GKPo

and three
http://imgur.com/8SL7scy

cores.  In the former, it seems to be working on one core, and in the 
latter, it appears to be working on three.  When reproducing the error, 
I'm getting behavior that isn't entirely consistent -- sometimes it 
"behaves" and operates on the asked-for number of cores, and other times 
not.

I'm also attaching a screenshot
http://imgur.com/bJfuS6R
showing terminal output from a remote cluster when I run my full model 
(N=67K) rather than a subset (N=7K) -- I get that error "Error in 
qr.qty(qrx, f) : right-hand side should have 60650 not 118451 rows ..."  
I suppose this is a memory overload problem?  Any suggestions on how to 
get bam to not call for more memory than the node has available would be 
welcome, though I suspect that is a supercomputing problem rather than a 
mgcv problem.  I don't know much about memory management, except that R 
doesn't do it explicitly.

Thanks,
Andrew

sessionInfo() for local machine:
1> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils datasets  methods
[8] base

other attached packages:
[1] mgcv_1.7-26  nlme_3.1-111

loaded via a namespace (and not attached):
[1] grid_3.0.2      lattice_0.20-23 Matrix_1.1-4
1>
On 08/21/2014 04:29 PM, Simon Wood wrote:
3 days later
#
Hi Andrew,

In some of the shots you sent then top was reporting several cores 
working. I think the problem here may be to do with the way bam is 
parallelized - at present only part of the computation is in parallel - 
the model matrix QR decomposition part. The smoothing parameter 
selection is still single cored (although we are working on that), so if 
you watch top, you'll usually see multi-core and single core phases 
alternating with each other. The strategy works best in n>>p situations 
with few smoothing parameters.

For the case where you used 31 cores, there was a bug in earlier mgcv 
versions in which it was assumed that when the model matrix is split 
into chunks for processing, each chunk would have more rows than 
columns. If you upgrade to the current mgcv version then this is fixed. 
However using 31 cores is liable to actually be less efficient than 
using fewer cores with the n to p (number of data to number of 
coefficients) ratio that you seem to have. This is because the work 
being done by each core is rather little, so that the overhead of 
stitching the cores' work back together becomes too high. Using 
'use.chol=TRUE' would reduce the overheads here (although it uses a 
slightly less stable algorithm than the default).

best,
Simon
On 22/08/14 06:13, Andrew Crane-Droesch wrote:

  
    
#
oops, I just realised that the fix referred to below is in mgcv 1.8-3 - 
not yet on CRAN.
On 25/08/14 13:20, Simon Wood wrote: