About Multicore: mclapply

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20120116/ee6106fc/attachment.pl>

Hello friends,
I was tryig to parallize a function using mclapply. But I find lapply() executes in lesser time than mclapply(). I have given here my system time taken for both the functions.
library(ShortRead)
library(multicore)> fqFiles <- list.files("./test")
system.time(lapply(fqFiles, function(fqFiles){
  readsFq <- readFastq(dirPath="./test",pattern=fqFiles)
  }))
   user  system elapsed 
  0.399   0.021   0.419 
system.time(mclapply(fqFiles, function(fqFiles){
   readsFq <- readFastq(dirPath="./test",pattern=fqFiles)},mc.cores=3))
   user  system elapsed 
  0.830   0.151   0.261 

Since the ./test directory contains three fastq files. I have used mc.cores = 3.

here is my mpstat output for mclapply()

04:47:55 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:47:56 PM  all   13.86    0.00    1.37    0.00    0.00    0.00    0.00   84.77   1023.23
04:47:56 PM    0   21.21    0.00    2.02    0.00    0.00    0.00    0.00   76.77   1011.11
04:47:56 PM    1   33.00    0.00    2.00    0.00    0.00    0.00    0.00   65.00      9.09
04:47:56 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
04:47:56 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      3.03
04:47:56 PM    4    3.03    0.00    2.02    0.00    0.00    0.00    0.00   94.95      0.00
04:47:56 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
04:47:56 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
04:47:56 PM    7   53.00    0.00    4.00    0.00    0.00    0.00    0.00   43.00      0.00

Hence,Can you please suggest me, why mclapply has taken more time than lapply()?

multicore is designed for parallel *computing* which is not what you do. For serial tasks (like yours) it will be always slower, because it needs to a) spawn processes b) read the data (serially since you use the same location) c) serialize all the data and send it to the master process, d) unserialize and concatenate all the data in the master process to a list. If you run lapply it does only b) which is in your case not the slowest part. Using multicore makes only sense if you actually perform computations (or any parallel task).

Cheers,
Simon
Thanking you in anticipation.
Regards,
Prashantha
Prashantha Hebbar Kiradi,

E-mail: prashantha.hebbar at dasmaninstitute.org

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
On Jan 16, 2012, at 9:02 AM, Prashantha Hebbar wrote:

Hello friends,
I was tryig to parallize a function using mclapply. But I find lapply() executes in lesser time than mclapply(). I have given here my system time taken for both the functions.
library(ShortRead)
library(multicore)>  fqFiles<- list.files("./test")
system.time(lapply(fqFiles, function(fqFiles){
   readsFq<- readFastq(dirPath="./test",pattern=fqFiles)
   }))
    user  system elapsed
   0.399   0.021   0.419
system.time(mclapply(fqFiles, function(fqFiles){
    readsFq<- readFastq(dirPath="./test",pattern=fqFiles)},mc.cores=3))
    user  system elapsed
   0.830   0.151   0.261

Since the ./test directory contains three fastq files. I have used mc.cores = 3.

here is my mpstat output for mclapply()

04:47:55 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
04:47:56 PM  all   13.86    0.00    1.37    0.00    0.00    0.00    0.00   84.77   1023.23
04:47:56 PM    0   21.21    0.00    2.02    0.00    0.00    0.00    0.00   76.77   1011.11
04:47:56 PM    1   33.00    0.00    2.00    0.00    0.00    0.00    0.00   65.00      9.09
04:47:56 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
04:47:56 PM    3    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      3.03
04:47:56 PM    4    3.03    0.00    2.02    0.00    0.00    0.00    0.00   94.95      0.00
04:47:56 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
04:47:56 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00      0.00
04:47:56 PM    7   53.00    0.00    4.00    0.00    0.00    0.00    0.00   43.00      0.00

Hence,Can you please suggest me, why mclapply has taken more time than lapply()?
In case it's not clear, the system.time 'elapsed' time shows that 
mclapply *is* faster overall ('wall clock') -- I would have only .261 
seconds to go for coffee, compared to .419 with lapply.

As Simon suggests, a much more common paradigm is to put more work in to 
the function evaluated by lapply --, e.g., calculating qa() -- and then 
returning the result of the computation, typically a much smaller 
summary of the bigger data. Even in this case, your computer will need 
to have enough memory to hold all the fastq data in memory; for some 
purposes it will make more sense to use FastqSampler and FastqStreamer 
to iterate over your file.

Martin
multicore is designed for parallel *computing* which is not what you do. For serial tasks (like yours) it will be always slower, because it needs to a) spawn processes b) read the data (serially since you use the same location) c) serialize all the data and send it to the master process, d) unserialize and concatenate all the data in the master process to a list. If you run lapply it does only b) which is in your case not the slowest part. Using multicore makes only sense if you actually perform computations (or any parallel task).

Cheers,
Simon

Thanking you in anticipation.
Regards,
Prashantha
Prashantha Hebbar Kiradi,

E-mail: prashantha.hebbar at dasmaninstitute.org

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793