Snow, parApply computational times

Martin Morgan · 2012-09-28T13:22:11Z

Hi Elena -- On 09/28/2012 04:02 AM, Elena Grassi wrote: > Hello, > > I'm trying to use snow and parApply to parallelize a litte script of > mine that computes fisher > tests but I'm puzzled by the timing results. Maybe I should use a matrix > and other functions like parRapply/parCapply? > On a fairly big data set (413795 tests to be done) the simple version > took 449m33.496s, > while the parallel one (with 4 nodes of a socket cluster) 486m24.006s. > While running it begins with 4 processes an

Martin Morgan

Fri, Sep 28, 2012 6:22 AM

Hi Elena --

On 09/28/2012 04:02 AM, Elena Grassi wrote:

the problem is that the amount of 'work' for each calculation is small, 
so that any gain from parallel calculation is offset by the cost of 
sending data back and forth, etc. Try to group the work into a list of 
tasks, and do each element on the task list on a separate processor. 
Here is some data

     opt <- list(parallel = 8)
     df <- data.frame(a=3, b=1, c=1, d=3, id=seq_len(1000))

we use the 'parallel' package, which comes with base R. On a non-windows 
machine we'd rather use mclapply without explicitly making a cluster.

     library(parallel)
     cl <- makeCluster(opt$parallel, "SOCK")

here's the function we want to apply to each row; I changed the return 
value so that it's a numeric vector rather than a list

     get_fisher <- function(counts){
       mat <- matrix(as.numeric(counts[c("a","b", "c", "d")]), ncol=2)
       colnames(mat) <- c('1', '2')
       rownames(mat) <- c('f', 'g')
       f <- fisher.test(as.table(mat), alt="two.sided")
       return(c(counts[["id"]], f$p.value))
     }

here we divide the work in to a list of tasks

     idx <- splitIndices(nrow(df), opt$parallel)

worth taking a look at idx -- str(idx) -- a list of length opt$parallel, 
with each element containing a vector of indices representing the rows 
that a cluster node will operate on. Now send the elements of idx to the 
processor nodes, one vector of indices to each processor. On the 
processors, do the work using apply.

     result0 <- parLapply(cl, idx, function(i, df, FUN) {
         apply(df[i,,drop=FALSE], 1, FUN)
     }, df, get_fisher)

We need to knit our tasks back together, which in this case can be done with

     result <- do.call(rbind, lapply(result0, t))

Hope this helps,

Martin

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

Snow, parApply computational times

Thread (4 messages)