Skip to content
Prev 1521 / 2152 Next

Snow, parApply computational times

Hi Elena --
On 09/28/2012 04:02 AM, Elena Grassi wrote:
the problem is that the amount of 'work' for each calculation is small, 
so that any gain from parallel calculation is offset by the cost of 
sending data back and forth, etc. Try to group the work into a list of 
tasks, and do each element on the task list on a separate processor. 
Here is some data

     opt <- list(parallel = 8)
     df <- data.frame(a=3, b=1, c=1, d=3, id=seq_len(1000))

we use the 'parallel' package, which comes with base R. On a non-windows 
machine we'd rather use mclapply without explicitly making a cluster.

     library(parallel)
     cl <- makeCluster(opt$parallel, "SOCK")

here's the function we want to apply to each row; I changed the return 
value so that it's a numeric vector rather than a list

     get_fisher <- function(counts){
       mat <- matrix(as.numeric(counts[c("a","b", "c", "d")]), ncol=2)
       colnames(mat) <- c('1', '2')
       rownames(mat) <- c('f', 'g')
       f <- fisher.test(as.table(mat), alt="two.sided")
       return(c(counts[["id"]], f$p.value))
     }

here we divide the work in to a list of tasks

     idx <- splitIndices(nrow(df), opt$parallel)

worth taking a look at idx -- str(idx) -- a list of length opt$parallel, 
with each element containing a vector of indices representing the rows 
that a cluster node will operate on. Now send the elements of idx to the 
processor nodes, one vector of indices to each processor. On the 
processors, do the work using apply.

     result0 <- parLapply(cl, idx, function(i, df, FUN) {
         apply(df[i,,drop=FALSE], 1, FUN)
     }, df, get_fisher)

We need to knit our tasks back together, which in this case can be done with

     result <- do.call(rbind, lapply(result0, t))

Hope this helps,

Martin