An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/00f6c2b2/attachment.pl>
Why pure computation time in parallel is longer than the serial version?
12 messages · Roman Luštrik, Brian G. Peterson, George Ostrouchov +2 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/5085a59b/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/0a70a689/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140213/662780c9/attachment.pl>
Consider using pbdR. It puts PBLAS and ScaLAPACK at your disposal for FORTRAN speed matrix parallelism without the need to learn their API. While built for truly big machines, you will see a lot of benefit with your size machine already. Start with pbdDEMO to learn the basics. It is batch computing (because that's what's done on big machines) with Rscript but the speed and simplicity are worth it! Cheers, George
On 2/13/14 2:32 AM, romunov wrote:
When doing calculations in parallel, there's also some overhead costs. If computation time per core is short, the overhead costs may exceed the computation time, time-wise raising the cost of parallel task. Cheers, Roman On Thu, Feb 13, 2014 at 5:26 AM, Xuening Zhu <puddingnnn529 at gmail.com>wrote:
I am learning about parallel computing in R , and I found this happening in
my experiments.
Briefly, in the following example, why are most values of user time in
t smaller
than that in mc_t ? My machine has 32GB memory, 2 cpus with 4 cores and 8
hyper threads in total. Tools such as BLAS to enhance performance aren't
installed as well.
system.time({t = lapply(1:4,function(i) {
m = matrix(1:10^6,ncol=100)
t = system.time({
m%*%t(m)
})
return(t)})})
library(multicore)
system.time({
mc_t = mclapply(1:4,function(m){
m = matrix(1:10^6,ncol=100)
t = system.time({
m%*%t(m)
})
return(t)
},mc.cores=4)})
t[[1]]
user system elapsed 11.136 0.548 11.703 [[2]] user system elapsed 11.533 0.548 12.098 [[3]] user system elapsed 11.665 0.432 12.115 [[4]] user system elapsed 11.580 0.512 12.115
mc_t[[1]]
user system elapsed
16.677 0.496 17.199
[[2]]
user system elapsed
16.741 0.428 17.198
[[3]]
user system elapsed
16.653 0.520 17.198
[[4]]
user system elapsed
11.056 0.444 11.520
mc_t and t measures pure computation time according to my comprehension.
Such things happens to parLapply in parallel package as well. The memory
in my machine is enough for that computation. (It just takes a few percent
of that).
Also I try to run 4 similar Rscript as below by hand using command
'Rscript' at the same time on the same machine and save the results. The
elapsed time for each of them is about 12s as well. So I think it may not
be the contention of cores.
system.time({t = lapply(1,function(i) {
m = matrix(1:10^6,ncol=100)
t = system.time({
m%*%t(m)
})
return(t)})})
So what happened during the parallel? Does mc_t really measure the pure
computation time? Can someone explain the whole process by step in detail?
Thanks.
--
Xuening Zhu
[[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140215/845c1e61/attachment.pl>
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140216/bcb82f0e/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140216/829f73ff/attachment.pl>
Someone politely pointed out to me in private that I meant to say 762 MiB, not 762 GiB. A 10000x10000 matrix is obviously not that big! However, the point still stands. -- Drew Schmidt National Institute for Computational Sciences University of Tennessee, USA http://r-pbd.org/
There are a few things going on here. Most notably, the script you provided is comparing two completely different operations. t(dx) %*% dx produces a matrix of dimension 10000x10000 (762 GiB), and mat %*% t(mat) produces a matrix of dimension 100x100 (78 KiB). Of course the second one will be faster. You also include the data generation within the parallel, but not serial timings, which doesn't fairly compare the computation time (especially for such small data, where hundredths of a second count). Making only these changes, on my machine the timings with 2 ranks are 0.033 for the serial operation, and 0.153 for the parallel one. Also, as noted above, this matrix is actually quite small, only about 8 MiB in size for the double precision storage it will get converted to for LAPACK/ScaLAPACK operations. For small matrices, the communication overhead in ScaLAPACK can eat you alive. You can see this by making the bldim large enough to encompass the entire matrix; even then, when the "parallel" product is done on one rank, there is some communication overhead. On my machine, again with 2 ranks, in this case the timings are 0.033 and 0.043, for serial and parallel respectively. You can use pbdDMAT effectively on a small shared memory machine, but it really begins to shine for larger, distributed platforms (servers, clusters, supercomputers). As a final side note, you can improve the performance of both the serial and parallel operations by using crossprod()/tcrossprod(). -- Drew Schmidt National Institute for Computational Sciences University of Tennessee, USA http://r-pbd.org/ On 02/16/2014 08:16 AM, Xuening Zhu wrote:
Hi George:
I wonder whether pbdR is better for multiple machines (computer
cluster)
for speeding up matrix computation? Since I have only a single machine,
I
didn't see much better performance here. I tried to compare 'parallel'
and
'pbdDMAT' package in the experiment for matrix multiplication in
parallel
below:
############################################
library(parallel)
mat = matrix(1:1e6,ncol=1e4)
group = sample(rep(1:4,length.out=ncol(mat)))
mm = lapply(split(seq(ncol(mat)),group),function(i) mat[,i])
system.time({
a = mclapply(mm,function(m){
m%*%t(m)
},mc.cores=2)
b = Reduce("+",a)
})
##############################################
library(pbdDMAT,quiet=TRUE)
init.grid()
tt = system.time({
#ScaLAPACK blocking dimension
bldim<-c(4,4)
#Generate data on process0, then distribute to theothers
if(comm.rank()==0){
mat = matrix(1:1e6,ncol=1e4)
}else{
mat=NULL
}
dx<-as.ddmatrix(x=mat,bldim=bldim)
#Computations in parallel
ddx<-t(dx)%*%dx
})
mm = as.matrix(ddx)
if(comm.rank()==0){
print(system.time({
MM = mat%*%t(mat)
}))
print(all.equal(MM,mm))
}
comm.print(tt)
finalize()
###############################################
The second one takes about 4.561 seconds while the first one takes only
0.104 seconds.
2014-02-14 1:21 GMT+08:00 George Ostrouchov <georgeost at gmail.com>:
Consider using pbdR. It puts PBLAS and ScaLAPACK at your disposal for FORTRAN speed matrix parallelism without the need to learn their API. While built for truly big machines, you will see a lot of benefit with your size machine already. Start with pbdDEMO to learn the basics. It is batch computing (because that's what's done on big machines) with Rscript but the speed and simplicity are worth it! Cheers, George On 2/13/14 2:32 AM, romunov wrote:
When doing calculations in parallel, there's also some overhead costs. If computation time per core is short, the overhead costs may exceed the computation time, time-wise raising the cost of parallel task. Cheers, Roman On Thu, Feb 13, 2014 at 5:26 AM, Xuening Zhu <puddingnnn529 at gmail.com> wrote: I am learning about parallel computing in R , and I found this happening
in
my experiments.
Briefly, in the following example, why are most values of user time
in
t smaller
than that in mc_t ? My machine has 32GB memory, 2 cpus with 4 cores
and 8
hyper threads in total. Tools such as BLAS to enhance performance
aren't
installed as well.
system.time({t = lapply(1:4,function(i) {
m = matrix(1:10^6,ncol=100)
t = system.time({
m%*%t(m)
})
return(t)})})
library(multicore)
system.time({
mc_t = mclapply(1:4,function(m){
m = matrix(1:10^6,ncol=100)
t = system.time({
m%*%t(m)
})
return(t)
},mc.cores=4)})
t[[1]]
user system elapsed
11.136 0.548 11.703
[[2]]
user system elapsed
11.533 0.548 12.098
[[3]]
user system elapsed
11.665 0.432 12.115
[[4]]
user system elapsed
11.580 0.512 12.115
mc_t[[1]]
user system elapsed
16.677 0.496 17.199
[[2]]
user system elapsed
16.741 0.428 17.198
[[3]]
user system elapsed
16.653 0.520 17.198
[[4]]
user system elapsed
11.056 0.444 11.520
mc_t and t measures pure computation time according to my
comprehension.
Such things happens to parLapply in parallel package as well. The
memory
in my machine is enough for that computation. (It just takes a few
percent
of that).
Also I try to run 4 similar Rscript as below by hand using command
'Rscript' at the same time on the same machine and save the results.
The
elapsed time for each of them is about 12s as well. So I think it may
not
be the contention of cores.
system.time({t = lapply(1,function(i) {
m = matrix(1:10^6,ncol=100)
t = system.time({
m%*%t(m)
})
return(t)})})
So what happened during the parallel? Does mc_t really measure the
pure
computation time? Can someone explain the whole process by step in
detail?
Thanks.
--
Xuening Zhu
[[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
[[alternative HTML version deleted]]
_______________________________________________ R-sig-hpc mailing list R-sig-hpc at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140217/c1974bb2/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140217/dd1ac4d4/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20140218/bb99a4c6/attachment.pl>