Matrix multiplication

What does automatically mean? Is X%*%t(X) parallelized?

Patrik
Simon Urbanek <simon.urbanek at r-project.org> 03/13/12 07:27 AM >>>

Dear members,

I noticed that there isn't a function for matrix multiplication in the new parallel library. What would be the most efficient way to do a matrix multiplication there?

The parallel package is for *explicit* parallelization. R already does implicit parallelization (using OpenMP or multi-threaded BLAS or both) automatically - this includes matrix multiplication.

Cheers,
Simon
Simon Urbanek <simon.urbanek at r-project.org> 03/13/12 07:27 AM >>>
On Mar 12, 2012, at 5:40 AM, Patrik Waldmann wrote:

Dear members,

I noticed that there isn't a function for matrix multiplication in the new parallel library. What would be the most efficient way to do a matrix multiplication there?

The parallel package is for *explicit* parallelization. R already does implicit parallelization (using OpenMP or multi-threaded BLAS or both) automatically - this includes matrix multiplication.

What does automatically mean? Is X%*%t(X) parallelized?
Matrix multiplication %*% is a BLAS function, as Simon and Claudia already told you.

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.

Because the actual matrix multiplication operation is carried out by the
BLAS, R doesn't really care how the BLAS does it... it could be on one
thread (non-parallel), on multiple threads (as with gotoblas or openblas
configured that way) or on a GPU (as with Magma BLAS), and R would not
care.

'explicit' parallelization if for taking some other code in R and
explicitly telling R to use a certain number of worker nodes to
accomplish the task.  This type of parallelization is often used for
simulation and optimization, where the block of code to be parallelized
may be very large.

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock
Brian

Thanks for spelling this out for those of us that are a bit slow. 
(Newbie questions below)
Simon Urbanek<simon.urbanek at r-project.org>  03/13/12 07:27 AM>>>
On Mar 12, 2012, at 5:40 AM, Patrik Waldmann wrote:

Dear members,

I noticed that there isn't a function for matrix multiplication in the new parallel library. What would be the most efficient way to do a matrix multiplication there?

The parallel package is for *explicit* parallelization. R already does implicit parallelization (using OpenMP or multi-threaded BLAS or both) automatically - this includes matrix multiplication.
On Tue, 2012-03-13 at 10:23 +0100, Patrik Waldmann wrote:
What does automatically mean? Is X%*%t(X) parallelized?
Matrix multiplication %*% is a BLAS function, as Simon and Claudia already told you.

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with 
multi-thread BLAS support?
Because the actual matrix multiplication operation is carried out by the
BLAS, R doesn't really care how the BLAS does it... it could be on one
thread (non-parallel), on multiple threads (as with gotoblas or openblas
configured that way) or on a GPU (as with Magma BLAS), and R would not
care.

'explicit' parallelization if for taking some other code in R and
explicitly telling R to use a certain number of worker nodes to
accomplish the task.  This type of parallelization is often used for
simulation and optimization, where the block of code to be parallelized
may be very large.

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or 
can it be done on the fly while running R processes?

Thanks, Paul
Brian

Thanks for spelling this out for those of us that are a bit slow. 
(Newbie questions below)
<... snip ...>
So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with 
multi-thread BLAS support?
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.

<...snip...>
Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or 
can it be done on the fly while running R processes?
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.  

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

   - Brian
Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock
On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
Brian

Thanks for spelling this out for those of us that are a bit slow.
(Newbie questions below)
<... snip ...>

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with
multi-thread BLAS support?
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.
(I have a long history of getting things that should 'just work' to 
'just not work'.) But I didn't really state my question very well. I'm 
really wondering about two related situations. How can I confirm after a 
change to underlying system that R is using the new configuration, and 
second, if I am  running benchmarks in R is there an easy way to record 
the underlying configuration that is being used.

Thanks again,
Paul
<...snip...>

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or
can it be done on the fly while running R processes?
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

    - Brian

(I have a long history of getting things that should 'just work' to 
'just not work'.) But I didn't really state my question very well. I'm
really wondering about two related situations. How can I confirm after
a change to underlying system that R is using the new configuration,
and second, if I am  running benchmarks in R is there an easy way to
record the underlying configuration that is being used.
I usually use 'top' in another 'screen' window.  In the case of
explicity parallelization, you'll see more R processes.  In the case of
implicit parallelization, you'll see (at least) the CPU utilization go
up to or over 100% on the single R process (and up to 100% on each
individual core) while the calculation happens.

Good luck,

   - Brian

Brian G. Peterson
http://braverock.com/brian/
Ph: 773-459-4973
IM: bgpbraverock
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20120313/66b3009d/attachment.pl>

On 12-03-13 12:50 PM, Brian G. Peterson wrote:
On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
Brian

Thanks for spelling this out for those of us that are a bit slow.
(Newbie questions below)
<... snip ...>

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with
multi-thread BLAS support?
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.
(I have a long history of getting things that should 'just work' to 'just not work'.) But I didn't really state my question very well. I'm really wondering about two related situations. How can I confirm after a change to underlying system that R is using the new configuration, and second, if I am  running benchmarks in R is there an easy way to record the underlying configuration that is being used.

You can check whether you're leveraging multiple cores simply via system.time:
m=matrix(rnorm(4e6),2000)
system.time(m %*% m)
user  system elapsed 
  6.860   0.020   0.584 

The above is clearly using threaded BLAS (here I'm using ATLAS), because the elapsed time is much smaller than the CPU time so it was computed in parallel. In contrast this is what you get using single-threaded R BLAS on the same machine:
system.time(m %*% m)
user  system elapsed 
 10.480   0.020  10.505 

It takes about 18x longer - this is a combination of the number of cores and the less optimized BLAS - and the elapsed time is greater or equal to the CPU time = single-threaded.

As for recording the underlying configuration - that is not really possible in general - you have to know what you enabled/compiled. In case of a shared BLAS implementation you may be able to infer that from the library name, but for static BLAS it is close to impossible to figure it out.

Cheers,
Simon
Thanks again,
Paul
<...snip...>

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or
can it be done on the fly while running R processes?
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

   - Brian

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

 On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).
:-) Yes, I do change it in practice, because I have steps where I use
explicit parallelization via multicore or snow and I switch between the
3 different parallel computation types. Our server has 2 hex-core CPUs
but only 8 GB RAM. The spectroscopic data analysis I use usually isn't
really hard computationally, but the data sets are often uncomfortably
large for the server. With explicit parallelization RAM often restricts
me to 2 or 3 threads.

Here's what I observe and why I switch back and forth:

If the calculation is implicitly parallel with the optimized BLAS,
that's the way to go. Easiest on RAM, fast, no whatsoever coding effort.
Just lean back and enjoy seeing all cores hard at work.
There are functions like %*% and (t)crossprod that use all 12 cores (or
whatever I restrict NUM_GOTO_THREADS to).

Other functions, e.g. loess () seem never to use more than all 6 cores
of one CPU. For these, I'm better off with explicit parallelization with
2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each
node). However, snow (and multicore) need more RAM as the data must be
loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to
leave an "alibi-core" for my colleague) in the main R session, and e.g.
2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4.

Multicore doesn't make use of the implicit parallelization of the BLAS.
But it is easier to use than snow: no cluster set up required, no hassle
with exporting all variables, etc.
So, if the function anyways doesn't have any implicit parallelization, I
just change lapply to mclapply, and that's it.

Best,

Claudia
On Mar 13, 2012, at 3:05 PM, Paul Gilbert wrote:

On 12-03-13 12:50 PM, Brian G. Peterson wrote:
On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
Brian

Thanks for spelling this out for those of us that are a bit slow.
(Newbie questions below)
<... snip ...>

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with
multi-thread BLAS support?
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.
(I have a long history of getting things that should 'just work' to 'just not work'.) But I didn't really state my question very well. I'm really wondering about two related situations. How can I confirm after a change to underlying system that R is using the new configuration, and second, if I am  running benchmarks in R is there an easy way to record the underlying configuration that is being used.

You can check whether you're leveraging multiple cores simply via system.time:

m=matrix(rnorm(4e6),2000)
system.time(m %*% m)
    user  system elapsed
   6.860   0.020   0.584

The above is clearly using threaded BLAS (here I'm using ATLAS), because
the elapsed time is much smaller than the CPU time so it was computed in parallel.
Perhaps I am misreading something. I don't see elapse < CPU, so it does 
not seem quite as obvious as you suggest, but I certainly see the 
difference with the single-thread below.
In contrast this is what you get using single-threaded R BLAS on the same machine:

system.time(m %*% m)
    user  system elapsed
  10.480   0.020  10.505

It takes about 18x longer - this is a combination of the number of cores and the less optimized BLAS - and the elapsed time is greater or equal to the CPU time = single-threaded.

As for recording the underlying configuration - that is not really possible in general - you have toknow what you enabled/compiled. In case of a shared BLAS implementation 
you may be able to infer that from the library name, but for static BLAS 
it is close to impossible to figure it out.

I was afraid this would be the case. It is often hard to keep track even 
when I'm compiling R myself, and I guess if you don't compile yourself 
there is not much hope of knowing what you really have.
(Food for thought when considering timing comparisons.)

Thanks,
Paul
Cheers,
Simon

Thanks again,
Paul
<...snip...>

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or
can it be done on the fly while running R processes?
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

    - Brian

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).
:-) Yes, I do change it in practice, because I have steps where I use
explicit parallelization via multicore or snow and I switch between the
3 different parallel computation types. Our server has 2 hex-core CPUs
but only 8 GB RAM. The spectroscopic data analysis I use usually isn't
really hard computationally, but the data sets are often uncomfortably
large for the server. With explicit parallelization RAM often restricts
me to 2 or 3 threads.

Here's what I observe and why I switch back and forth:

If the calculation is implicitly parallel with the optimized BLAS,
that's the way to go. Easiest on RAM, fast, no whatsoever coding effort.
Just lean back and enjoy seeing all cores hard at work.
There are functions like %*% and (t)crossprod that use all 12 cores (or
whatever I restrict NUM_GOTO_THREADS to).

Other functions, e.g. loess () seem never to use more than all 6 cores
of one CPU. For these, I'm better off with explicit parallelization with
2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each
node). However, snow (and multicore) need more RAM as the data must be
loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to
leave an "alibi-core" for my colleague) in the main R session, and e.g.
2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4.
How does this work?  I can imagine problems where I could use 
Sys.setenv() within an R function, to speed up different parts of a 
calculation in different ways,  but if goto is reading an environment 
variable everytime it does a calculation, that would slow it down a 
whole lot.

Thanks,
Paul
Multicore doesn't make use of the implicit parallelization of the BLAS.
But it is easier to use than snow: no cluster set up required, no hassle
with exporting all variables, etc.
So, if the function anyways doesn't have any implicit parallelization, I
just change lapply to mclapply, and that's it.

Best,

Claudia

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

On 12-03-13 09:59 PM, Simon Urbanek wrote:
On Mar 13, 2012, at 3:05 PM, Paul Gilbert wrote:

On 12-03-13 12:50 PM, Brian G. Peterson wrote:
On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
Brian

Thanks for spelling this out for those of us that are a bit slow.
(Newbie questions below)
<... snip ...>

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with
multi-thread BLAS support?
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.
(I have a long history of getting things that should 'just work' to 'just not work'.) But I didn't really state my question very well. I'm really wondering about two related situations. How can I confirm after a change to underlying system that R is using the new configuration, and second, if I am  running benchmarks in R is there an easy way to record the underlying configuration that is being used.

You can check whether you're leveraging multiple cores simply via system.time:

m=matrix(rnorm(4e6),2000)
system.time(m %*% m)
   user  system elapsed
  6.860   0.020   0.584

The above is clearly using threaded BLAS (here I'm using ATLAS), because
the elapsed time is much smaller than the CPU time so it was computed in parallel.
Perhaps I am misreading something. I don't see elapse < CPU,
0.584 < 6.86
so it does not seem quite as obvious as you suggest, but I certainly see the difference with the single-thread below.

In contrast this is what you get using single-threaded R BLAS on the same machine:

system.time(m %*% m)
   user  system elapsed
 10.480   0.020  10.505

It takes about 18x longer - this is a combination of the number of cores and the less optimized BLAS - and the elapsed time is greater or equal to the CPU time = single-threaded.

As for recording the underlying configuration - that is not really possible in general - you have toknow what you enabled/compiled. In case of a shared BLAS implementation 
you may be able to infer that from the library name, but for static BLAS it is close to impossible to figure it out.

I was afraid this would be the case. It is often hard to keep track even when I'm compiling R myself, and I guess if you don't compile yourself there is not much hope of knowing what you really have.
(Food for thought when considering timing comparisons.)

It is separate from R (at least as long as you have shared BLAS enabled which is the default for most distributions) -- so it's really about what you point your BLAS to.

But, yes, timing comparisons are pretty meaningless unless you specify everything you have (this is how some can post benchmarks against strawman installations and claim to be faster although there is in fact no difference).

Cheers,
Simon
Thanks,
Paul

Cheers,
Simon

Thanks again,
Paul
<...snip...>

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or
can it be done on the fly while running R processes?
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

   - Brian

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

On Mar 14, 2012, at 12:53 PM, Paul Gilbert wrote:

On 12-03-13 09:59 PM, Simon Urbanek wrote:
On Mar 13, 2012, at 3:05 PM, Paul Gilbert wrote:

On 12-03-13 12:50 PM, Brian G. Peterson wrote:
On Tue, 2012-03-13 at 12:40 -0400, Paul Gilbert wrote:
Brian

Thanks for spelling this out for those of us that are a bit slow.
(Newbie questions below)
<... snip ...>

So, if your BLAS does multithreaded matrix multiplication, it will use
multiple threads 'implicitly', as Simon pointed out.
Is there an easy way to know if the R I am using has been compiled with
multi-thread BLAS support?
BLAS should be 'plug and play', as R is usually compiled to use a shared
object BLAS.  As such, installing the BLAS on your machine (and
appropriately configuring it) should 'just work' with te new BLAS when
you restart R.

Dirk et. al. wrote a paper, now a bit dated, that benchmarked some of
the BLAS libraries, that should have some additional details.
(I have a long history of getting things that should 'just work' to 'just not work'.) But I didn't really state my question very well. I'm really wondering about two related situations. How can I confirm after a change to underlying system that R is using the new configuration, and second, if I am  running benchmarks in R is there an easy way to record the underlying configuration that is being used.

You can check whether you're leveraging multiple cores simply via system.time:

m=matrix(rnorm(4e6),2000)
system.time(m %*% m)
    user  system elapsed
   6.860   0.020   0.584

The above is clearly using threaded BLAS (here I'm using ATLAS), because
the elapsed time is much smaller than the CPU time so it was computed in parallel.
Perhaps I am misreading something. I don't see elapse<  CPU,
0.584<  6.86
Once again I was reading system.time the wrong way. I should know by 
now. Thanks,
Paul

so it does not seem quite as obvious as you suggest, but I certainly see the difference with the single-thread below.

In contrast this is what you get using single-threaded R BLAS on the same machine:

system.time(m %*% m)
    user  system elapsed
  10.480   0.020  10.505

It takes about 18x longer - this is a combination of the number of cores and the less optimized BLAS - and the elapsed time is greater or equal to the CPU time = single-threaded.

As for recording the underlying configuration - that is not really possible in general - you have toknow what you enabled/compiled. In case of a shared BLAS implementation
you may be able to infer that from the library name, but for static BLAS it is close to impossible to figure it out.

I was afraid this would be the case. It is often hard to keep track even when I'm compiling R myself, and I guess if you don't compile yourself there is not much hope of knowing what you really have.
(Food for thought when considering timing comparisons.)

It is separate from R (at least as long as you have shared BLAS enabled which is the default for most distributions) -- so it's really about what you point your BLAS to.

But, yes, timing comparisons are pretty meaningless unless you specify everything you have (this is how some can post benchmarks against strawman installations and claim to be faster although there is in fact no difference).

Cheers,
Simon

Thanks,
Paul

Cheers,
Simon

Thanks again,
Paul
<...snip...>

Be aware that there can be unintended negative interactions between
implicit and explicit parallelization.  On cluster nodes I tend to
configure the BLAS to use only one thread to avoid resource contention
when all cores are doing explicit parallelization.
How do you do this? Does it need to be done when you are compiling R, or
can it be done on the fly while running R processes?
Some BLAS, like gotoblas, support an environment variable to change the
number of cores to be used.  This can be changed at run-time.  Others,
like the mkl, are always multithreaded.  Others, like ATLAS, can be
compiled in either single threaded or multi-threaded modes.

So, for me, on my cluster nodes, I use a single threaded BLAS, assuming
that *explicit* parallelization will be the primary driver of CPU load,
and not wanting to over-commit the processor when 12 calculations each
try to spawn 12 threads in the BLAS.  On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).

Regards,

    - Brian

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Claudia,

On other machines, I might use a
multithreaded BLAS like gotoblas so that I have some flexibility (though
apparently unlike Claudia, I rarely change it in practice).
:-) Yes, I do change it in practice, because I have steps where I use
explicit parallelization via multicore or snow and I switch between the
3 different parallel computation types. Our server has 2 hex-core CPUs
but only 8 GB RAM. The spectroscopic data analysis I use usually isn't
really hard computationally, but the data sets are often uncomfortably
large for the server. With explicit parallelization RAM often restricts
me to 2 or 3 threads.

Here's what I observe and why I switch back and forth:

If the calculation is implicitly parallel with the optimized BLAS,
that's the way to go. Easiest on RAM, fast, no whatsoever coding effort.
Just lean back and enjoy seeing all cores hard at work.
There are functions like %*% and (t)crossprod that use all 12 cores (or
whatever I restrict NUM_GOTO_THREADS to).

Other functions, e.g. loess () seem never to use more than all 6 cores
of one CPU. For these, I'm better off with explicit parallelization with
2 snow nodes and NUM_GOTO_THREADS = 6 (I have to execute taskset on each
node). However, snow (and multicore) need more RAM
Snow does but not multicore - the benefit of multicore is that all data at the point of parallelization is shared and thus it doesn't use extra memory (at least on modern OSes that support COW fork). The only extra RAM will be whatever is allocated later for the computation that is run in parallel.
as the data must be
loaded in each node. That would mean e.g. NUM_GOTO_THREADS = 11 (to
leave an "alibi-core" for my colleague) in the main R session, and e.g.
2 nodes with NUM_GOTO_THREADS = 6 or 3 nodes with NUM_GOTO_THREADS = 4.

Multicore doesn't make use of the implicit parallelization of the BLAS.
Actually, it does:
system.time(mclapply(1:4, function(i) sum(tcrossprod(m^i))))
user  system elapsed 
 10.136   0.568   0.664 

However, you really want to control the interplay of the explicit and implicit parallelization. This is where the parallel package comes into play (and why it includes multicore) so that for the explicit + R-implicit parallelization (not BLAS, though) we can control the maximal load (and RNG).

Cheers,
Simon
But it is easier to use than snow: no cluster set up required, no hassle
with exporting all variables, etc.
So, if the function anyways doesn't have any implicit parallelization, I
just change lapply to mclapply, and that's it.

Best,

Claudia

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Simon and Paul,

seems I have trouble with some part of the configuration on the server:
I'm not able any longer to change the number of threads for the
gotoblas, it always stays at 6 (which is fortunately a quite sensible
number).
So, before believing what I wrote yesterday, please try yourself.
node). However, snow (and multicore) need more RAM
Snow does but not multicore - the benefit of multicore is that all
data at the point of parallelization is shared and thus it doesn't
use extra memory (at least on modern OSes that support COW fork). The
only extra RAM will be whatever is allocated later for the
computation that is run in parallel.
Yes, you are right: unlike snow multicore does not need copies of the
same data.

However, in practice, the stuff I parallelize explicitly are often
bootstrap or similar calculations, so I do need more RAM because each
thread uses its own resampled data set. Which of course is not
Multicore doesn't make use of the implicit parallelization of the
BLAS.
Actually, it does:
I get
m <-matrix (1:9e6, 3e3)
system.time(lapply(1:4, function(i) sum(tcrossprod(m^i))))
User      System verstrichen
     13.751       2.570       4.527
and see 6 cores working.

with multicore:
multicore:::detectCores ()
[1] 12

Firt try: mc.cores = 2, as 2 x 6 = 12:
system.time(mclapply(1:4, function(i) sum(tcrossprod(m^i)), mc.cores = 2))
Timing stopped at: 123.457 266.559 195.029

without mc.cores, in case that screwed up something:
system.time(mclapply(1:4, function(i) sum(tcrossprod(m^i))))
Timing stopped at: 2569.413 5758.595 2075.161
I see 4 cores working at 100 %

I do have the problem that I always need to execute
system(sprintf('taskset -p 0xffffffff %d', Sys.getpid()))
at the beginning of the R session. With snow, I execute that on the
nodes as well, but with multicore I don't now how to do that.

So probably the configuration is really messed up...
user  system elapsed 10.136   0.568   0.664

However, you really want to control the interplay of the explicit and
implicit parallelization. This is where the parallel package comes
into play (and why it includes multicore) so that for the explicit +
R-implicit parallelization (not BLAS, though) we can control the
maximal load (and RNG).
sessionInfo ()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=de_DE.UTF-8       LC_NUMERIC=C
LC_TIME=de_DE.UTF-8        LC_COLLATE=de_DE.UTF-8
 [5] LC_MONETARY=de_DE.UTF-8    LC_MESSAGES=de_DE.UTF-8    LC_PAPER=C
              LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] multicore_0.1-7

loaded via a namespace (and not attached):
[1] tools_2.14.1

Best,

Claudia
Cheers, Simon

But it is easier to use than snow: no cluster set up required, no
hassle with exporting all variables, etc. So, if the function
anyways doesn't have any implicit parallelization, I just change
lapply to mclapply, and that's it.

Best,

Claudia

_______________________________________________ R-sig-hpc mailing
list R-sig-hpc at r-project.org 
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc