Parallel linear model
Hi Paul --
On 08/22/2012 04:03 PM, Paul Johnson wrote:
Martin: This is a great example and I would like to use it in class. But I
it's pretty fragile, relying on a design matrix where only a single column differs between runs.
think I don't understand the implications of the system.time output you get. I have a question about this below. Would you share your thoughts? On Wed, Aug 22, 2012 at 4:21 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 08/22/2012 12:47 AM, Patrik Waldmann wrote:
Hello,
I wonder if someone has experience with efficient ways of implicit
parallel execution of (repeated) linear models (as in the non-parallel
example below)? Any suggestions on which way to go?
Patrik Waldmann
pval<-c(1:n)
for (i in 1:n){
mod <- lm(y ~ x[,i])
pval[i] <- summary(mod)$coefficients[2,4]
}
As a different tack, the design matrix is the same across all regressions,
and if your data are consistently structured it may pay to re-calculate the
fit alone. Here's a loosely-tested version that uses a template from a full
fit augmented by the fit of individual columns to the same model
looselm <- function(y, xi, tmpl)
{
x <- cbind(`(Intercept)`= 1, xi=xi)
z <- lm.fit(x, y)
tmpl[names(z)] <- z
tmpl
}
This is used in f2
f0 <- function(x, y)
lapply(seq_len(ncol(x)),
function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
x, y)
f1 <- function(x, y, mc.cores=8L)
mclapply(seq_len(ncol(x)),
function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
x, y, mc.cores=mc.cores)
f2 <- function(x, y) {
tmpl <- lm(y~x[,1])
lapply(seq_len(ncol(x)),
function(i, x, y, tmpl) {
summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
}, x, y, tmpl)
}
f3 <- function(x, y, mc.cores=8) {
tmpl <- lm(y~x[,1])
mclapply(seq_len(ncol(x)),
function(i, x, y, tmpl) {
summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
}, x, y, tmpl, mc.cores=mc.cores)
}
with timings (for 1000 x 1000)
system.time(ans0 <- f0(x, y))
user system elapsed 23.865 1.160 25.120
system.time(ans1 <- f1(x, y, 8L))
user system elapsed 31.902 6.705 6.708
system.time(ans2 <- f2(x, y))
user system elapsed 5.285 0.296 5.596
system.time(ans3 <- f3(x, y, 8L))
user system elapsed 10.256 4.092 2.322
The ans2 version has user 5.2 and system 0.29, which are much better than ans3. Ordinarily, I'd focus on the "user" part, and I'd think f2 (ordinary lapply) is much faster. However, the "elapsed" value for ans3 is half of ans2. How can elapsed be smaller for ans3? I'm guessing that a larger amount of work is divided among 8 cores?
Norm's answer made me look first at system.time then at print.proc_time,
discovering that the column labelled 'user' is the sum of the parent and
child process user times, and likewise for 'system' (the times for
parent versus children are available in the system.time() return value).
'elapsed' is wall time.
> t1 = system.time(ans1 <- f1(x, y, 8L))
> unclass(t1)
user.self sys.self elapsed user.child sys.child
0.040 0.032 4.791 24.777 2.892
> t3 <- system.time(ans3 <- f3(x, y, 8L))
> unclass(t3)
user.self sys.self elapsed user.child sys.child
0.060 0.016 1.190 14.009 1.852
(times are clearly varying between replicates, so perhaps rbenchmark is
in order).
There is additional work associated with forked processes, e.g.,
serializing results to return them to the parent; I'm also not sure who
(if anyone) gets dinged when processes are waiting to talk with one another.
When the mulicore functionality is called, "user" and "system" numbers double because _____?
Probably the 2x is a property of this example, reflecting the particular way computation versus communication and other overhead costs accrue. If the computation tasks were bigger, the communication and other overhead costs become less important.
But the total time elapsed is smaller because a larger amount of compute time is divided among more cores? In a system with many users logged in at the same time, it appears the reasonable thing is to tell them to use lapply, as in ans2, because the aggregate amount of computational power used is one-half of the multicore amount in ans3. I mean, if we have a finite amount of computation that can take place, the multicore requires twice as much aggregate work. While that one user benefits from smaller elapsed time, the aggregate of the system's amount of work is doubled. Or am I just thinking of this like a Soviet Communist of the 1950s....
For processing large data the tasks are probably big enough that it makes economic sense to consume as many resources as the system will allow. For exploratory, interactive jobs time is precious and it makes emotional sense to consume as many resources as the system will allow. Even in the best case, though, speed-up is only proportional to 1 / N. Martin
pj
Martin
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793