Skip to content

Matrix multiplication

15 messages · Patrik Waldmann, Brian G. Peterson, Sean O'Riordain +3 more

#
What does automatically mean? Is X%*%t(X) parallelized?

Patrik

        
On Mar 12, 2012, at 5:40 AM, Patrik Waldmann wrote:

            
The parallel package is for *explicit* parallelization. R already does implicit parallelization (using OpenMP or multi-threaded BLAS or both) automatically - this includes matrix multiplication.

Cheers,
Simon
#

        
On Tue, 2012-03-13 at 10:23 +0100, Patrik Waldmann wrote:
Matrix multiplication %*% is a BLAS function, as Simon and Claudia already told you.

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.

Because the actual matrix multiplication operation is carried out by the
BLAS, R doesn't really care how the BLAS does it... it could be on one
thread (non-parallel), on multiple threads (as with gotoblas or openblas
configured that way) or on a GPU (as with Magma BLAS), and R would not
care.

'explicit' parallelization if for taking some other code in R and
explicitly telling R to use a certain number of worker nodes to
accomplish the task.  This type of parallelization is often used for
simulation and optimization, where the block of code to be parallelized
may be very large.

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
#
Brian

Thanks for spelling this out for those of us that are a bit slow. 
(Newbie questions below)
On 12-03-13 08:54 AM, Brian G. Peterson wrote:
Is there an easy way to know if the R I am using has been compiled with 
multi-thread BLAS support?
How do you do this? Does it need to be done when you are compiling R, or 
can it be done on the fly while running R processes?

Thanks, Paul
#
On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
<... snip ...>
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.
 
<...snip...>
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.  

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

   - Brian
#
On 12-03-13 12:50 PM, Brian G. Peterson wrote:
(I have a long history of getting things that should 'just work' to 
'just not work'.) But I didn't really state my question very well. I'm 
really wondering about two related situations. How can I confirm after a 
change to underlying system that R is using the new configuration, and 
second, if I am  running benchmarks in R is there an easy way to record 
the underlying configuration that is being used.

Thanks again,
Paul
#
On Tue, 2012-03-13 at 15:05 -0400, Paul Gilbert wrote:
I usually use 'top' in another 'screen' window.  In the case of
explicity parallelization, you'll see more R processes.  In the case of
implicit parallelization, you'll see (at least) the CPU utilization go
up to or over 100% on the single R process (and up to 100% on each
individual core) while the calculation happens.

Good luck,

   - Brian
#
On Mar 13, 2012, at 3:05 PM, Paul Gilbert wrote:

            
You can check whether you're leveraging multiple cores simply via system.time:
user  system elapsed 
  6.860   0.020   0.584 

The above is clearly using threaded BLAS (here I'm using ATLAS), because the elapsed time is much smaller than the CPU time so it was computed in parallel. In contrast this is what you get using single-threaded R BLAS on the same machine:
user  system elapsed 
 10.480   0.020  10.505 

It takes about 18x longer - this is a combination of the number of cores and the less optimized BLAS - and the elapsed time is greater or equal to the CPU time = single-threaded.

As for recording the underlying configuration - that is not really possible in general - you have to know what you enabled/compiled. In case of a shared BLAS implementation you may be able to infer that from the library name, but for static BLAS it is close to impossible to figure it out.

Cheers,
Simon
#
:-) Yes, I do change it in practice, because I have steps where I use
explicit parallelization via multicore or snow and I switch between the
3 different parallel computation types. Our server has 2 hex-core CPUs
but only 8 GB RAM. The spectroscopic data analysis I use usually isn't
really hard computationally, but the data sets are often uncomfortably
large for the server. With explicit parallelization RAM often restricts
me to 2 or 3 threads.

Here's what I observe and why I switch back and forth:

If the calculation is implicitly parallel with the optimized BLAS,
that's the way to go. Easiest on RAM, fast, no whatsoever coding effort.
Just lean back and enjoy seeing all cores hard at work.
There are functions like %*% and (t)crossprod that use all 12 cores (or
whatever I restrict NUM_GOTO_THREADS to).

Other functions, e.g. loess () seem never to use more than all 6 cores
of one CPU. For these, I'm better off with explicit parallelization with
2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each
node). However, snow (and multicore) need more RAM as the data must be
loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to
leave an "alibi-core" for my colleague) in the main R session, and e.g.
2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4.

Multicore doesn't make use of the implicit parallelization of the BLAS.
But it is easier to use than snow: no cluster set up required, no hassle
with exporting all variables, etc.
So, if the function anyways doesn't have any implicit parallelization, I
just change lapply to mclapply, and that's it.

Best,

Claudia
#
On 12-03-13 09:59 PM, Simon Urbanek wrote:
Perhaps I am misreading something. I don't see elapse < CPU, so it does 
not seem quite as obvious as you suggest, but I certainly see the 
difference with the single-thread below.
you may be able to infer that from the library name, but for static BLAS 
it is close to impossible to figure it out.

I was afraid this would be the case. It is often hard to keep track even 
when I'm compiling R myself, and I guess if you don't compile yourself 
there is not much hope of knowing what you really have.
(Food for thought when considering timing comparisons.)

Thanks,
Paul
#
On 12-03-14 10:47 AM, Claudia Beleites wrote:
How does this work?  I can imagine problems where I could use 
Sys.setenv() within an R function, to speed up different parts of a 
calculation in different ways,  but if goto is reading an environment 
variable everytime it does a calculation, that would slow it down a 
whole lot.

Thanks,
Paul
#
On Mar 14, 2012, at 12:53 PM, Paul Gilbert wrote:

            
0.584 < 6.86
It is separate from R (at least as long as you have shared BLAS enabled which is the default for most distributions) -- so it's really about what you point your BLAS to.

But, yes, timing comparisons are pretty meaningless unless you specify everything you have (this is how some can post benchmarks against strawman installations and claim to be faster although there is in fact no difference).

Cheers,
Simon
#
On 12-03-14 01:24 PM, Simon Urbanek wrote:
Once again I was reading system.time the wrong way. I should know by 
now. Thanks,
Paul
#
Claudia,
On Mar 14, 2012, at 10:47 AM, Claudia Beleites wrote:

            
Snow does but not multicore - the benefit of multicore is that all data at the point of parallelization is shared and thus it doesn't use extra memory (at least on modern OSes that support COW fork). The only extra RAM will be whatever is allocated later for the computation that is run in parallel.
Actually, it does:
user  system elapsed 
 10.136   0.568   0.664 

However, you really want to control the interplay of the explicit and implicit parallelization. This is where the parallel package comes into play (and why it includes multicore) so that for the explicit + R-implicit parallelization (not BLAS, though) we can control the maximal load (and RNG).

Cheers,
Simon
#
Simon and Paul,

seems I have trouble with some part of the configuration on the server:
I'm not able any longer to change the number of threads for the
gotoblas, it always stays at 6 (which is fortunately a quite sensible
number).
So, before believing what I wrote yesterday, please try yourself.
Yes, you are right: unlike snow multicore does not need copies of the
same data.

However, in practice, the stuff I parallelize explicitly are often
bootstrap or similar calculations, so I do need more RAM because each
thread uses its own resampled data set. Which of course is not
I get
User      System verstrichen
     13.751       2.570       4.527
and see 6 cores working.

with multicore:
[1] 12

Firt try: mc.cores = 2, as 2 x 6 = 12:
Timing stopped at: 123.457 266.559 195.029


without mc.cores, in case that screwed up something:
Timing stopped at: 2569.413 5758.595 2075.161
I see 4 cores working at 100 %


I do have the problem that I always need to execute
system(sprintf('taskset -p 0xffffffff %d', Sys.getpid()))
at the beginning of the R session. With snow, I execute that on the
nodes as well, but with multicore I don't now how to do that.

So probably the configuration is really messed up...
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=C
              LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] multicore_0.1-7

loaded via a namespace (and not attached):
[1] tools_2.14.1


Best,

Claudia