Skip to content

Peculiar timing result

6 messages · Paul Gilbert, Brian Ripley, Douglas Bates

#
I have been timing a particular model fit using lmer on several
different computers and came up with a peculiar result - the model fit
is considerably slower on a dual-core Athlon 64 using Goto's
multithreaded BLAS than on a single-core processor.

Here is the timing on a single-core Athlon 64 3000+ running under
today's R-devel with version 0.995-5 of the Matrix package.
0      241720.:  1.16440 0.335239  0.00000  1.78732 0.867209 0.382318  0.00000
  1      239722.:  1.94952 5.00000e-10 0.00933767  1.65999 0.858003
0.341520 0.00908757
  2      239580.:  1.95924 0.0884059 0.00933767  1.65308 0.857487
0.339296 0.00954718
  3      239215.:  2.60877 0.0765848 0.0177699  1.45739 0.843562
0.275100 0.0236849
  4      239204.:  2.62582 0.106670 0.0239698  1.41976 0.841086
0.261033 0.0267073
  5      239176.:  2.63149 0.0787924 0.0367185  1.37952 0.838527
0.245076 0.0301134
  6      239141.:  2.64949 0.108534 0.0594586  1.28846 0.832543
0.208404 0.0375456
  7      239049.:  2.64794 0.0789214 0.121782  1.10436 0.819711
0.126101 0.0524965
  8      239004.:  2.66044 0.117957 0.181505 0.932202 0.798982
0.0718116 0.0628958
  9      238944.:  2.66310 0.0819653 0.334477 0.631735 0.740855
0.258359 0.0806590
 10      238893.:  2.72626 0.0975205 0.653432 0.703912 0.666067
0.109922 0.201809
 11      238892.:  2.74381 0.111146 0.666155 0.693544 0.662000 0.104060 0.207591
 12      238888.:  2.75052 0.0990238 0.689199 0.694588 0.655781
0.106516 0.216460
 13      238861.:  2.80325 0.126935  1.05912 0.733914 0.556162 0.159296 0.360938
 14      238832.:  2.82656 0.117617  1.59471 0.607916 0.441371
0.0749944 0.976142
 15      238811.:  2.87350 0.136332  1.59046 0.653141 0.353763 0.226061  1.79285
 16      238810.:  2.87663 0.125135  1.58992 0.656808 0.352605 0.220488  1.79282
 17      238806.:  2.89342 0.141551  1.58607 0.676523 0.344212 0.181833  1.79268
 18      238804.:  2.90080 0.125137  1.56624 0.682921 0.261295 0.180598  1.74325
 19      238802.:  2.91950 0.128569  1.56836 0.680436 0.336051 0.159940  1.80400
 20      238801.:  2.92795 0.134762  1.56597 0.685121 0.331695 0.145547  1.80414
 21      238801.:  2.93741 0.125667  1.56139 0.687827 0.332700 0.138854  1.81495
 22      238800.:  2.94588 0.131757  1.55294 0.687909 0.330414 0.137834  1.82826
 23      238799.:  2.96867 0.129716  1.52943 0.688678 0.323171 0.139912  1.84615
 24      238799.:  2.98994 0.133378  1.52188 0.700038 0.337387 0.131403  1.77623
 25      238799.:  3.00312 0.135308  1.51475 0.697550 0.311750 0.145683  1.78037
 26      238799.:  3.00461 0.129920  1.51083 0.697666 0.306722 0.138745  1.81188
 27      238799.:  3.00504 0.134882  1.50539 0.696745 0.302949 0.135897  1.84405
 28      238799.:  3.00422 0.134013  1.47947 0.698115 0.303243 0.133806  1.86486
 29      238799.:  3.00819 0.134378  1.48185 0.701871 0.307097 0.134637  1.84996
 30      238799.:  3.01313 0.134279  1.49098 0.702883 0.304788 0.133682  1.86254
 31      238799.:  3.01291 0.134253  1.49060 0.701818 0.303155 0.133771  1.84613
 32      238799.:  3.01142 0.134314  1.48921 0.701782 0.303589 0.134439  1.84653
 33      238799.:  3.01174 0.134315  1.48926 0.701641 0.304120 0.134145  1.84635
 34      238799.:  3.01175 0.134304  1.48942 0.701740 0.303762 0.134185  1.84649
 35      238799.:  3.01173 0.134307  1.48937 0.701724 0.303809 0.134206  1.84647
[1] 43.10  3.78 48.41  0.00  0.00


(If you run the timing yourself and don't want to see the iteration
output, take the msV=1 out of the control list.  I keep it in there so
I can monitor the progress.)

If I time the same model fit on a dual-core Athlon 64 X2 3800+ with
the same version of R, BLAS and Matrix package, the timing ends up
with something like

90 140 235 0 0

I do see that the multithreading is working on a calculation that is
essentially BLAS-bound such as
[1] 0.57 0.02 0.60 0.00 0.00

On the X2 processor it still takes about 0.6 seconds user time but
only 0.3 seconds elapsed time when the machine is otherwise idle and
both cores are available for the calculation.

Any suggestions why the dual-core processor is so much slower than the
single core processor?

By the way, I would be interested in accumulating timings of this
model fit on other systems.  If you do time it please send me
(off-list) a summary of the version of R, version of the accelerated
BLAS if you use them, processor speed and configuration (i.e.
multiprocessor, multicore, etc.) and, if you know it, memory speed.

This is an example of a complex multilevel model with crossed grouping
factors fit to a relatively large (30000 observations on 10000
students, 1400 teachers and 80 schools) data set.
#
Doug

This is probably not your reason, but I am finding my dual core Athlon 
64 is much slower running 64 bit Linux and R than it was running 32 bit 
Linux and R. All the programs are bigger.  (Some, like the clock applet, 
are a lot bigger for no obvious reason.)  The difference is enough to 
put my meager 1GB machine into swapping much more, with the result that 
it is a lot slower.

Paul
Douglas Bates wrote:

            
#
On Fri, 3 Mar 2006, Douglas Bates wrote:

            
Is there a Goto BLAS tuned for that chip?  I can only see one tuned for an 
(unspecified) Opteron.  L1 and L2 cache sizes do sometimes matter a lot 
for tuned BLAS, and (according to the AMD site I just looked up) the X2 
3800+ only has a 512Kb per core L2 cache.  Opterons have a 1Mb L2 cache.

Also, the very large system time seen in the dual-core run is typical of 
what I see when pthreads is not working right, and I suggest you try a 
limit of one thread (see the R-admin manual).  On our dual-processor 
Opteron 248 that ran in 44 secs instead of 328.
control = list(nit=0,grad=0,msV=1)))
....
#
I don't think this calculation is memory-bound at all and I would be
surprised if changing to a 32-bit environment would change things.  I
do have a 32-bit chroot environment on these machines (needed for
things like wine and acroread) so I'll try that out but I think I will
need to use Atlas as the accelerated BLAS rather than Goto's BLAS.  I
imagine the Opteron/Athlon 64 version of Goto's BLAS assumes a 64-bit
environment.
On 3/3/06, Paul Gilbert <pgilbert at bank-banque-canada.ca> wrote:
7 days later
#
Here is a summary of some results on a dual Opteron 252 running FC3

64-bit gcc 3.4.5
R's blas		34.83  3.45 38.56
ATLAS			36.70  3.28 40.14
ATLAS multithread	76.85  5.39 82.29
Goto 1 thread		36.17  3.44 39.76
Goto multithread       178.06 345.97 467.99
ACML			49.69  3.36 53.23

64-bit gcc 4.1.0
R's blas		34.98  3.49 38.55
32-bit gcc 3.4.5
R's blas		33.72  3.27 36.99
32-bit gcc 4.1.0
R's blas		34.62  3.25 37.93

The timings are not that repeatable, but the message seems clear that
this problem does not benefit from a tuned BLAS and the overhead from 
multiple threads is harmful.  (The gcc 4.1.0 results took fewer 
iterations, which skews the results in its favour.)

And my 2GHz Pentium M laptop under Windows gave 39.96  3.68 44.06.

Clearly the Goto BLAS has a problem here: the results are slower on a dual 
252 than a dual 248 (see below).
On Fri, 3 Mar 2006, Prof Brian Ripley wrote:

            

  
    
2 days later
#
On 3/11/06, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
Thanks for the information on the timings.  It happens that this
message ended up in my R-help folder and I only got around to reading
that folder today.

Is it ok with you if I forward this message to Simon Urbanek?  I am
having similar difficulties in the timing with R on a dual-core Intel
MacBook.