Why pure computation time in parallel is longer than the serial version?

Hi Xuening,

2 physical vs 2 physical * 2 logical threads: See e.g. here: 
http://unix.stackexchange.com/a/88290

You say you have 2 *physical* cores. That's the number you want to use 
for the parallel execution. Logical cores are just 2 (or more) threads 
running on the same physical core. IIRC, this can speed up things mainly 
if the 2 threads run very different operations.
Yes, this is my experience - I turn off Intel hyperthreads in BIOS to 
prevent software getting confused. BLAS sees available compute resources, 
so your BLAS may be installed to see 4 cores, but doesn't know that two 
are hyperthreads and compete for physical resources. It may be that by 
limiting BLAS to 2, it gets privileged access to the two real cores, and 
other OS (or other) tasks running at the same time use the hyperthreads.

Roger
I think this does not help the BLAS because there you have massive 
amounts of the *same* operations and they can easily be parallelized. 
Instead, you get overhead and maybe caching/scheduling "conflicts".

I think if you want to profit from the 2 phys * 2 log. thread 
architecture, you'd need to optimize compilation to be aware of this. 
But even then I'd not expect too much here: 2 threads on the physical 
cores probably don't leave much space for other calculations to be done 
"meanwhile".

All in all, I think it is just the same behaviour you see when 
scheduling more threads than cores in general (e.g. on a machine that 
has 1 logical core per physical core).

HTH,

Claudia

--
Claudia Beleites, Chemist
Spectroscopy/Imaging
Leibniz Institute of Photonic Technology
Albert-Einstein-Str. 9
07745 Jena
Germany

email: claudia.beleites at ipht-jena.de
phone: +49 3641 206-133
fax:   +49 2641 206-399

________________________________________
Von: r-sig-hpc-bounces at r-project.org [r-sig-hpc-bounces at r-project.org]&quot; im Auftrag von &quot;Xuening Zhu [puddingnnn529 at gmail.com]
Gesendet: Samstag, 22. Februar 2014 11:30
An: Roger Bivand
Cc: r-sig-hpc at r-project.org
Betreff: Re: [R-sig-hpc] Why pure computation time in parallel is longer than the serial version?

Roger,
Much thanks to you~ I've done some further experiments to exploit something
like openblas for parallel.
My cpu is *Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz.* There are 2 physical
cores and additional 2 logical cores. The memory size is 8G. And my
operating system is unbuntu 12.04.

I choose a 10^3 * 10^4 matrix and wants to evaluate its
multiplication(t(m)%*%m) time. I don't consider tcrossprod() because I just
want to make the computation longer. Maybe more cases can be compared later.

The R (3.0.2) is re-compiled with open-blas. Inline function is employed to
change the number of threads are defined as below:
require(inline)
openblas.set.num.threads <- cfunction( signature(ipt="integer"),
                                      body =
'openblas_set_num_threads(*ipt);',
                                      otherdefs = c ('extern void
openblas_set_num_threads(int);'),
                                      libargs = c
('-L/home/pudding/OpenBLAS/ -lopenblas'),
                                      language = "C",
                                      convention = ".C")
##################################################################################

First I only compare open-blas with default R BLAS in the experiment:
(1) multiplication with the default R BLAS:

mat = matrix(1:1e7,ncol=1e4)> system.time(t(mat)%*%mat)  user  system elapsed
84.517  0.320  85.090

(2) open-blas with 2 threads specified:

openblas.set.num.threads(2)
$ipt
[1] 2
system.time(t(mat)%*%mat)
  user  system elapsed
10.164   0.512   5.549

(3) open-blas with 4 threads specified:

openblas.set.num.threads(4)
$ipt
[1] 4

system.time(t(mat)%*%mat)
  user  system elapsed
26.954   1.556   8.147

Things is a little strange that 4 threads is even slower than 2 threads!

##################################################################
Then I want to mix multicore with open-blas. I try to change the implicit
parallel of matrix multiplication into explicit version. So I just split
the data into several partitions and  Things become very wired here.

(1) First I specify the number of threads to be 1 in open-blas, and 2 cores
are used in mclapply:
openblas.set.num.threads(1)
$ipt
[1] 1

system.time({
+   group = sample(rep(1:8,length.out=ncol(mat)))
+   mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
+   #mcaffinity(1:8)
+   #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
+   #cores = detectCores()
+   a = mclapply(mm,function(m){
+     cat('Running!!\n')
+     t(m)%*%m
+     #tcrossprod(m)
+   },mc.cores=2)
+   b = Reduce("+",a)
+ })
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
  user  system elapsed
 0.352   0.168   1.363

(2) Then I change cores in mclapply (in parallel package) from 2 to 4. The
time is even longer as below. (Sometimes it even gives me a segfault error
but unfortunately I have no means to reproduce it ><.)

system.time({
+   group = sample(rep(1:8,length.out=ncol(mat)))
+   mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
+   #mcaffinity(1:8)
+   #system(sprintf("taskset -p 0xffffffff %d", Sys.getpid()))
+   #cores = detectCores()
+   a = mclapply(mm,function(m){
+     cat('Running!!\n')
+     t(m)%*%m
+     #tcrossprod(m)
+   },mc.cores=4)
+   b = Reduce("+",a)
+ })
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
Running!!
  user  system elapsed
 0.400   0.148   1.597

(3)  When I change the number of threads from 1 to some number>1, I can't
have results being returned from mclapply any more. There are some
conflicts between mclapply and open-blas multi-threads speeding algorithm,
I guess.

2014-02-18 18:38 GMT+08:00 Roger Bivand <Roger.Bivand at nhh.no>:

wesley goi <wesley at ...> writes:

Hi xuening,

I use multicore's mclapply() function extensively and have recently
changed the BLAS lib to openblas to help with running a PCA on a big
matrix, everything ran fine. However, I was wondering if the openblas
lib will interfere with multicore.

So i guess so far there? EURO (tm)s no way to assigned the threads which
openblas uses hence it shdnt be used in a multicore script to be
submitted to a cluster else it? EURO'll consume all the cores?

Please do use the list archives; the thread:

https://stat.ethz.ch/pipermail/r-sig-hpc/2012-April/001339.html

provides much insight into the AFFINITY issue - see also mcaffinity()
in the parallel package. If your BLAS is trying to use all available
cores anyway, and you then try to run in parallel on top of that, your
high-level processes will compete across the available cores for
resources with BLAS, as each BLAS call on each core will try to
spread work across the same set of cores. Please also see:

http://www.jstatsoft.org/v31/i01/

and perhaps also:

http://ideas.repec.org/p/hhs/nhheco/2010_025.html

Neither are new, but are based on trying things out rather than
speculating. As pointed out before, Brian's comment tells you what you
need to know:

https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/662780c9/
attachment.pl

Hope this clarifies,

Roger

On 18 Feb, 2014, at 11:25 am, Xuening Zhu <puddingnnn529 at ...> wrote:

Hi Wesley:
I installed open-blas before. It went well when I run serial
operations. 2 threads can be seen in 'top'. But
I can't change thread number through the methods it provided.

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

--
Xuening Zhu
--------------------------------------------------------
Master of Business Statistics
Guanghua School of Management, Peking University

       [[alternative HTML version deleted]]

Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

Why pure computation time in parallel is longer than the serial version?

Thread (19 messages)