Parallel linear model

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20120822/c54a76bc/attachment.pl>
Hello,

I wonder if someone has experience with efficient ways of implicit parallel execution of (repeated) linear models (as in the non-parallel example below)? Any suggestions on which way to go?

Patrik Waldmann

pval<-c(1:n)
for (i in 1:n){
mod <- lm(y ~ x[,i])
pval[i] <- summary(mod)$coefficients[2,4]
}
As a different tack, the design matrix is the same across all 
regressions, and if your data are consistently structured it may pay to 
re-calculate the fit alone. Here's a loosely-tested version that uses a 
template from a full fit augmented by the fit of individual columns to 
the same model

looselm <- function(y, xi, tmpl)
{
     x <- cbind(`(Intercept)`= 1, xi=xi)
     z <- lm.fit(x, y)
     tmpl[names(z)] <- z
     tmpl
}

This is used in f2

f0 <- function(x, y)
     lapply(seq_len(ncol(x)),
            function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
            x, y)

f1 <- function(x, y, mc.cores=8L)
     mclapply(seq_len(ncol(x)),
              function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
              x, y, mc.cores=mc.cores)

f2 <- function(x, y) {
     tmpl <- lm(y~x[,1])
     lapply(seq_len(ncol(x)),
            function(i, x, y, tmpl)  {
                summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
            }, x, y, tmpl)
}

f3 <- function(x, y, mc.cores=8) {
     tmpl <- lm(y~x[,1])
     mclapply(seq_len(ncol(x)),
              function(i, x, y, tmpl)  {
                  summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
              }, x, y, tmpl, mc.cores=mc.cores)
}

with timings (for 1000 x 1000)

 > system.time(ans0 <- f0(x, y))
    user  system elapsed
  23.865   1.160  25.120
 > system.time(ans1 <- f1(x, y, 8L))
    user  system elapsed
  31.902   6.705   6.708
 > system.time(ans2 <- f2(x, y))
    user  system elapsed
   5.285   0.296   5.596
 > system.time(ans3 <- f3(x, y, 8L))
    user  system elapsed
  10.256   4.092   2.322

and

 > identical(ans0, ans1)
[1] TRUE
 > identical(ans0, ans2)
[1] TRUE
 > identical(ans0, ans3)
[1] TRUE

Presumably the full summary() machinery is also not required. Likely 
there are significant additional short-cuts.

Martin
	[[alternative HTML version deleted]]

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
Martin:

This  is a great example and I would like to use it in class.  But I
think I don't understand the implications of the system.time output
you get.  I have a question about this below. Would you share your
thoughts?
On 08/22/2012 12:47 AM, Patrik Waldmann wrote:
Hello,

I wonder if someone has experience with efficient ways of implicit
parallel execution of (repeated) linear models (as in the non-parallel
example below)? Any suggestions on which way to go?

Patrik Waldmann

pval<-c(1:n)
for (i in 1:n){
mod <- lm(y ~ x[,i])
pval[i] <- summary(mod)$coefficients[2,4]
}

As a different tack, the design matrix is the same across all regressions,
and if your data are consistently structured it may pay to re-calculate the
fit alone. Here's a loosely-tested version that uses a template from a full
fit augmented by the fit of individual columns to the same model

looselm <- function(y, xi, tmpl)
{
    x <- cbind(`(Intercept)`= 1, xi=xi)
    z <- lm.fit(x, y)
    tmpl[names(z)] <- z
    tmpl
}

This is used in f2

f0 <- function(x, y)
    lapply(seq_len(ncol(x)),
           function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
           x, y)

f1 <- function(x, y, mc.cores=8L)
    mclapply(seq_len(ncol(x)),
             function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
             x, y, mc.cores=mc.cores)

f2 <- function(x, y) {
    tmpl <- lm(y~x[,1])
    lapply(seq_len(ncol(x)),
           function(i, x, y, tmpl)  {
               summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
           }, x, y, tmpl)
}

f3 <- function(x, y, mc.cores=8) {
    tmpl <- lm(y~x[,1])
    mclapply(seq_len(ncol(x)),
             function(i, x, y, tmpl)  {
                 summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
             }, x, y, tmpl, mc.cores=mc.cores)
}

with timings (for 1000 x 1000)

system.time(ans0 <- f0(x, y))
   user  system elapsed
 23.865   1.160  25.120
system.time(ans1 <- f1(x, y, 8L))
   user  system elapsed
 31.902   6.705   6.708
system.time(ans2 <- f2(x, y))
   user  system elapsed
  5.285   0.296   5.596
system.time(ans3 <- f3(x, y, 8L))
   user  system elapsed
 10.256   4.092   2.322

The ans2 version has user 5.2 and system 0.29, which are much better than ans3.

Ordinarily, I'd focus on the "user" part, and I'd think f2 (ordinary
lapply) is much faster.

However, the "elapsed" value for ans3 is half of ans2. How can elapsed
be smaller for ans3? I'm guessing that a larger amount of work is
divided among 8 cores?

When the mulicore functionality is called, "user" and "system" numbers
double because _____?

But the total time elapsed is smaller because a larger amount of
compute time is divided among more cores?

In a system with many users logged in at the same time, it appears the
reasonable thing is to tell them to use lapply, as in ans2, because
the aggregate amount of computational power used is one-half of the
multicore amount in ans3.  I mean, if we have a finite amount of
computation that can take place, the multicore requires twice as much
aggregate work.  While that one user benefits from smaller elapsed
time, the aggregate of the system's amount of work is doubled.

Or am I just thinking of this like a Soviet Communist of the 1950s....

pj
Martin

Paul E. Johnson
Professor, Political Science    Assoc. Director
1541 Lilac Lane, Room 504     Center for Research Methods
University of Kansas               University of Kansas
http://pj.freefaculty.org            http://quant.ku.edu

This  is a great example and I would like to use it in class.  But I
think I don't understand the implications of the system.time output
you get.  I have a question about this below. Would you share your
thoughts?...
Paul is bringing up a very important point here.

There are various OS dependencies that can really change things.  A
notable example is that if one calls something like mclapply(), the time
actually spent by the child R processes probably will NOT be counted in
the User time.  The latter will likely just measure how much time the
parent process spend in parceling out the work to the children, and in
collecting together the results.

You have the same problem on a cluster, where the worker processes set
up by clusterApply() or whatever aren't counted.

You could on the other hand have the opposite problem in some OSes,
where once gets the SUM of the times of the children.

Using Elapsed time might be a little crude, but generally good enough.

Norm
Hi Paul --
Martin:

This  is a great example and I would like to use it in class.  But I
it's pretty fragile, relying on a design matrix where only a single 
column differs between runs.
think I don't understand the implications of the system.time output
you get.  I have a question about this below. Would you share your
thoughts?

On Wed, Aug 22, 2012 at 4:21 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 08/22/2012 12:47 AM, Patrik Waldmann wrote:
Hello,

I wonder if someone has experience with efficient ways of implicit
parallel execution of (repeated) linear models (as in the non-parallel
example below)? Any suggestions on which way to go?

Patrik Waldmann

pval<-c(1:n)
for (i in 1:n){
mod <- lm(y ~ x[,i])
pval[i] <- summary(mod)$coefficients[2,4]
}

As a different tack, the design matrix is the same across all regressions,
and if your data are consistently structured it may pay to re-calculate the
fit alone. Here's a loosely-tested version that uses a template from a full
fit augmented by the fit of individual columns to the same model

looselm <- function(y, xi, tmpl)
{
     x <- cbind(`(Intercept)`= 1, xi=xi)
     z <- lm.fit(x, y)
     tmpl[names(z)] <- z
     tmpl
}

This is used in f2

f0 <- function(x, y)
     lapply(seq_len(ncol(x)),
            function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
            x, y)

f1 <- function(x, y, mc.cores=8L)
     mclapply(seq_len(ncol(x)),
              function(i, x, y) summary(lm(y~x[,i]))$coefficients[2, 4],
              x, y, mc.cores=mc.cores)

f2 <- function(x, y) {
     tmpl <- lm(y~x[,1])
     lapply(seq_len(ncol(x)),
            function(i, x, y, tmpl)  {
                summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
            }, x, y, tmpl)
}

f3 <- function(x, y, mc.cores=8) {
     tmpl <- lm(y~x[,1])
     mclapply(seq_len(ncol(x)),
              function(i, x, y, tmpl)  {
                  summary(looselm(y, x[,i], tmpl))$coefficients[2, 4]
              }, x, y, tmpl, mc.cores=mc.cores)
}

with timings (for 1000 x 1000)

system.time(ans0 <- f0(x, y))
    user  system elapsed
  23.865   1.160  25.120
system.time(ans1 <- f1(x, y, 8L))
    user  system elapsed
  31.902   6.705   6.708
system.time(ans2 <- f2(x, y))
    user  system elapsed
   5.285   0.296   5.596
system.time(ans3 <- f3(x, y, 8L))
    user  system elapsed
  10.256   4.092   2.322

The ans2 version has user 5.2 and system 0.29, which are much better than ans3.

Ordinarily, I'd focus on the "user" part, and I'd think f2 (ordinary
lapply) is much faster.

However, the "elapsed" value for ans3 is half of ans2. How can elapsed
be smaller for ans3? I'm guessing that a larger amount of work is
divided among 8 cores?
Norm's answer made me look first at system.time then at print.proc_time, 
discovering that the column labelled 'user' is the sum of the parent and 
child process user times, and likewise for 'system' (the times for 
parent versus children are available in the system.time() return value). 
'elapsed' is wall time.

 > t1 = system.time(ans1 <- f1(x, y, 8L))
 > unclass(t1)
  user.self   sys.self    elapsed user.child  sys.child
      0.040      0.032      4.791     24.777      2.892
 > t3 <- system.time(ans3 <- f3(x, y, 8L))
 > unclass(t3)
  user.self   sys.self    elapsed user.child  sys.child
      0.060      0.016      1.190     14.009      1.852

(times are clearly varying between replicates, so perhaps rbenchmark is 
in order).

There is additional work associated with forked processes, e.g., 
serializing results to return them to the parent; I'm also not sure who 
(if anyone) gets dinged when processes are waiting to talk with one another.
When the mulicore functionality is called, "user" and "system" numbers
double because _____?
Probably the 2x is a property of this example, reflecting the particular 
way computation versus communication and other overhead costs accrue. If 
the computation tasks were bigger, the communication and other overhead 
costs become less important.
But the total time elapsed is smaller because a larger amount of
compute time is divided among more cores?

In a system with many users logged in at the same time, it appears the
reasonable thing is to tell them to use lapply, as in ans2, because
the aggregate amount of computational power used is one-half of the
multicore amount in ans3.  I mean, if we have a finite amount of
computation that can take place, the multicore requires twice as much
aggregate work.  While that one user benefits from smaller elapsed
time, the aggregate of the system's amount of work is doubled.

Or am I just thinking of this like a Soviet Communist of the 1950s....
For processing large data the tasks are probably big enough that it 
makes economic sense to consume as many resources as the system will 
allow. For exploratory, interactive jobs time is precious and it makes 
emotional sense to consume as many resources as the system will allow. 
Even in the best case, though, speed-up is only proportional to 1 / N.

Martin
pj

Martin

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
The difference between user and elapsed is an old hat. Here is a great
example (and IIRC first shown here by Simon) with no compute time:

   R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by default
      user  system elapsed 
     0.000   0.012   4.014 
   R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
      user  system elapsed 
     0.012   0.020   1.039 
   R> 

so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
Ripley's choice a default of two is baked-in, so clueless users only get a
small gain.  "user time" is roughly the actual system load _summed over all
processes / threads_.

With that, could I ask any of the participants in the thread to re-try with a
proper benchmarking package such as rbenchmark or microbenchmark?  Either one
beats to the socks of system.time:

   R> library(rbenchmark)
   R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), replications=1)
                                                      test replications elapsed relative user.self sys.self user.child sys.child
   1               mclapply(1:8, function(x) Sys.sleep(1))            1   4.013  3.89612     0.000    0.008      0.000     0.004
   2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1   1.030  1.00000     0.004    0.008      0.004     0.000
   R> 

and

   R> library(microbenchmark)
   R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
   Unit: seconds
                                                      expr     min      lq  median      uq     max
   1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377 4.01377 4.01377 4.01377
   2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457 1.03457 1.03457 1.03457
   R> 

(and you normally want to run either with 10 or 100 or ... replications /
times).

Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

On Wed, Aug 22, 2012 at 06:03:36PM -0500, Paul Johnson wrote:

This  is a great example and I would like to use it in class.  But I
think I don't understand the implications of the system.time output
you get.  I have a question about this below. Would you share your
thoughts?...
Paul is bringing up a very important point here.

There are various OS dependencies that can really change things.  A
notable example is that if one calls something like mclapply(), the time
actually spent by the child R processes probably will NOT be counted in
the User time.
That is actually wrong. It is true for snow where the processes are separate, but most systems do account for child user time in mclapply:

# Linux
system.time(mclapply(1:32, function(x) for(i in 1:1e6) x+x, mc.cores=32))
user  system elapsed 
 27.330   1.468   0.944
system.time((function(x) for(i in 1:1e6) x+x)(1))           
user  system elapsed 
  0.736   0.000   0.734 

# OS X
system.time(mclapply(1:16, function(x) for(i in 1:1e6) x+x, mc.cores=16))
user  system elapsed 
  9.386   0.357   0.876
system.time((function(x) for(i in 1:1e6) x+x)(1))           
user  system elapsed 
  0.425   0.004   0.428 

Cheers,
Simon
 The latter will likely just measure how much time the
parent process spend in parceling out the work to the children, and in
collecting together the results.

You have the same problem on a cluster, where the worker processes set
up by clusterApply() or whatever aren't counted.

You could on the other hand have the opposite problem in some OSes,
where once gets the SUM of the times of the children.

Using Elapsed time might be a little crude, but generally good enough.

Norm

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Thanks for the correction.  I must say that I still have my nagging
doubts, though.  Is it true for older systems, say some Linuxes of 4-5
years ago, or BSD?

Norm
On Aug 22, 2012, at 7:18 PM, Norm Matloff wrote:

On Wed, Aug 22, 2012 at 06:03:36PM -0500, Paul Johnson wrote:

This  is a great example and I would like to use it in class.  But I
think I don't understand the implications of the system.time output
you get.  I have a question about this below. Would you share your
thoughts?...
Paul is bringing up a very important point here.

There are various OS dependencies that can really change things.  A
notable example is that if one calls something like mclapply(), the time
actually spent by the child R processes probably will NOT be counted in
the User time.
That is actually wrong. It is true for snow where the processes are separate, but most systems do account for child user time in mclapply:

# Linux
system.time(mclapply(1:32, function(x) for(i in 1:1e6) x+x, mc.cores=32))
   user  system elapsed 
 27.330   1.468   0.944 
system.time((function(x) for(i in 1:1e6) x+x)(1))           
   user  system elapsed 
  0.736   0.000   0.734 

# OS X
system.time(mclapply(1:16, function(x) for(i in 1:1e6) x+x, mc.cores=16))
   user  system elapsed 
  9.386   0.357   0.876 
system.time((function(x) for(i in 1:1e6) x+x)(1))           
   user  system elapsed 
  0.425   0.004   0.428 

Cheers,
Simon

 The latter will likely just measure how much time the
parent process spend in parceling out the work to the children, and in
collecting together the results.

You have the same problem on a cluster, where the worker processes set
up by clusterApply() or whatever aren't counted.

You could on the other hand have the opposite problem in some OSes,
where once gets the SUM of the times of the children.

Using Elapsed time might be a little crude, but generally good enough.

Norm

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

In rereading your posting now, Dirk, I suddenly realized that there is
one aspect of this that I'd forgotten about:  An ordinary call to
system.time() does not display all the information returned by that
function!

That odd statement is of course due to the fact that the print
method for objects of class proc_time displays only 3 of the 5 numbers.
If one actually looks at the 5 numbers individually, you can separate
the time of the parent process from the sum of the child times.  That
separation is apparently what rbenchmark gives you, right?

As I said earlier, the quick-and-dirty way to handle this is to use the
Elapsed time, typically good enough (say on a dedicated machine).  After
all, if we are trying to develop a fast parallel algorithm, what the
potential users of the algorithm care about is essentially the Elapsed
time.

But at the other extreme, a very fine timing goal might be to try to
compute what is called the makespan, which in this case would be the
maximum of all the child times, rather than the sum of the child times.
I say "try," because I don't see any systems way to accomplish this,
short of inserting calls to something like clock_gettime() inside each
thread.

Norm
The difference between user and elapsed is an old hat. Here is a great
example (and IIRC first shown here by Simon) with no compute time:

   R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by default
      user  system elapsed 
     0.000   0.012   4.014 
   R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
      user  system elapsed 
     0.012   0.020   1.039 
   R> 

so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
Ripley's choice a default of two is baked-in, so clueless users only get a
small gain.  "user time" is roughly the actual system load _summed over all
processes / threads_.

With that, could I ask any of the participants in the thread to re-try with a
proper benchmarking package such as rbenchmark or microbenchmark?  Either one
beats to the socks of system.time:

   R> library(rbenchmark)
   R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), replications=1)
                                                      test replications elapsed relative user.self sys.self user.child sys.child
   1               mclapply(1:8, function(x) Sys.sleep(1))            1   4.013  3.89612     0.000    0.008      0.000     0.004
   2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1   1.030  1.00000     0.004    0.008      0.004     0.000
   R> 

and

   R> library(microbenchmark)
   R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
   Unit: seconds
                                                      expr     min      lq  median      uq     max
   1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377 4.01377 4.01377 4.01377
   2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457 1.03457 1.03457 1.03457
   R> 

(and you normally want to run either with 10 or 100 or ... replications /
times).

Dirk

-- 
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20120823/0b27f45a/attachment.pl>
| 
| In rereading your posting now, Dirk, I suddenly realized that there is
| one aspect of this that I'd forgotten about:  An ordinary call to
| system.time() does not display all the information returned by that
| function!
| 
| That odd statement is of course due to the fact that the print
| method for objects of class proc_time displays only 3 of the 5 numbers.
| If one actually looks at the 5 numbers individually, you can separate
| the time of the parent process from the sum of the child times.  That
| separation is apparently what rbenchmark gives you, right?
| 
| As I said earlier, the quick-and-dirty way to handle this is to use the
| Elapsed time, typically good enough (say on a dedicated machine).  After
| all, if we are trying to develop a fast parallel algorithm, what the
| potential users of the algorithm care about is essentially the Elapsed
| time.

That seems fair in most cases.

| But at the other extreme, a very fine timing goal might be to try to
| compute what is called the makespan, which in this case would be the
| maximum of all the child times, rather than the sum of the child times.
| I say "try," because I don't see any systems way to accomplish this,
| short of inserting calls to something like clock_gettime() inside each
| thread.

Maybe you could look at what microbenchmark does [ as it covers all the
OS-level dirty work ] and see if it generalizes to multiple machines?

Dirk

| Norm
|
| > 
| > The difference between user and elapsed is an old hat. Here is a great
| > example (and IIRC first shown here by Simon) with no compute time:
| > 
| >    R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by default
| >       user  system elapsed 
| >      0.000   0.012   4.014 
| >    R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
| >       user  system elapsed 
| >      0.012   0.020   1.039 
| >    R> 
| > 
| > so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
| > overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
| > Ripley's choice a default of two is baked-in, so clueless users only get a
| > small gain.  "user time" is roughly the actual system load _summed over all
| > processes / threads_.
| > 
| > With that, could I ask any of the participants in the thread to re-try with a
| > proper benchmarking package such as rbenchmark or microbenchmark?  Either one
| > beats to the socks of system.time:
| > 
| >    R> library(rbenchmark)
| >    R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), replications=1)
| >                                                       test replications elapsed relative user.self sys.self user.child sys.child
| >    1               mclapply(1:8, function(x) Sys.sleep(1))            1   4.013  3.89612     0.000    0.008      0.000     0.004
| >    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1   1.030  1.00000     0.004    0.008      0.004     0.000
| >    R> 
| > 
| > and
| > 
| >    R> library(microbenchmark)
| >    R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
| >    Unit: seconds
| >                                                       expr     min      lq  median      uq     max
| >    1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377 4.01377 4.01377 4.01377
| >    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457 1.03457 1.03457 1.03457
| >    R> 
| > 
| > (and you normally want to run either with 10 or 100 or ... replications /
| > times).
| > 
| > Dirk
| > 
| > -- 
| > Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
| > 
| > _______________________________________________
| > R-sig-hpc mailing list
| > R-sig-hpc at r-project.org
| > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
| 
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
| Here's a comparison on Windows based on 8 cores (excluding foreach):
| > y<-rnorm(1000)
| > x<-matrix(rnorm(1000*10000),ncol=10000)
| > dimx<-dim(x)
| > library(rbenchmark)
| > benchmark(pval<-apply(x,2, function(x)summary(lm(y~x))$coeff[2,4]),
| replications=1)
|                                                             
| 1 pval <- apply(x, 2, function(x) summary(lm(y ~ x))$coeff[2, 4])            
| test replications elapsed relative user.self sys.self user.child sys.child
| 1    1   25.16        1     20.46     2.17         NA        NA
|  
| > library(parallel)
| > cores<-detectCores()
| > cl <- makeCluster(cores, methods=FALSE)
| > clusterExport(cl,"y")
| > benchmark(pval<-parApply(cl, x,2, function(x)summary(lm(y~x))$coeff[2,4]),
| replications=1)
| 1 pval <- parApply(cl, x, 2, function(x) summary(lm(y ~ x))$coeff[2, 4])
|   test replications elapsed relative user.self sys.self user.child sys.child
| 1      1    5.52        1      0.74     0.28         NA        NA
|  
| > stopCluster(cl)
|  
| # More fair
|  
| > benchmark({cores<-detectCores()
| + cl <- makeCluster(cores, methods=FALSE)
| + clusterExport(cl,"y")
| + pval<-parApply(cl, x,2, function(x)summary(lm(y~x))$coeff[2,4])},
| replications=1)
|   test replications elapsed relative user.self sys.self user.child sys.child
| 1    {            1    7.11        1      0.65     0.37         NA        NA
| Warning messages:
| 1: closing unused connection 10 (<-patwa-PC:10187)
| 2: closing unused connection 9 (<-patwa-PC:10187)
| 3: closing unused connection 8 (<-patwa-PC:10187)
| 4: closing unused connection 7 (<-patwa-PC:10187)
| 5: closing unused connection 6 (<-patwa-PC:10187)
| 6: closing unused connection 5 (<-patwa-PC:10187)
| 7: closing unused connection 4 (<-patwa-PC:10187)
| 8: closing unused connection 3 (<-patwa-PC:10187)
| > stopCluster(cl)
|  
| What does the warnings refer to?

I prefer doing HPC on Linux --- but on Windows, as I recall, this stems from
the socket connections needed to make things do.

Dirk

| Patrik
| 
| >>> Dirk Eddelbuettel <edd at debian.org> 23/08/2012 02:53 >>>
| 
| The difference between user and elapsed is an old hat. Here is a great
| example (and IIRC first shown here by Simon) with no compute time:
| 
|    R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by
| default
|       user  system elapsed
|      0.000   0.012   4.014
|    R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
|       user  system elapsed
|      0.012   0.020   1.039
|    R>
| 
| so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
| overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
| Ripley's choice a default of two is baked-in, so clueless users only get a
| small gain.  "user time" is roughly the actual system load _summed over all
| processes / threads_.
| 
| With that, could I ask any of the participants in the thread to re-try with a
| proper benchmarking package such as rbenchmark or microbenchmark?  Either one
| beats to the socks of system.time:
| 
|    R> library(rbenchmark)
|    R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8,
| function(x) Sys.sleep(1), mc.cores=8), replications=1)
|                                                       test replications elapsed
| relative user.self sys.self user.child sys.child
|    1               mclapply(1:8, function(x) Sys.sleep(1))            1  
| 4.013  3.89612     0.000    0.008      0.000     0.004
|    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1  
| 1.030  1.00000     0.004    0.008      0.004     0.000
|    R>
| 
| and
| 
|    R> library(microbenchmark)
|    R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8,
| function(x) Sys.sleep(1), mc.cores=8), times=1)
|    Unit: seconds
|                                                       expr     min      lq 
| median      uq     max
|    1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377
| 4.01377 4.01377 4.01377
|    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457
| 1.03457 1.03457 1.03457
|    R>
| 
| (and you normally want to run either with 10 or 100 or ... replications /
| times).
| 
| Dirk
| 
| --
| Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
| 
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
Thanks for pointing me to microbenchmark, Dirk.  I had not been aware of
it, and it definitely looks useful.  Among other things, it uses the
clock_gettime() function I mentioned yesterday, so this now would give
me a convenient wrapper for it when I'm working at the R level.

However, I should elaborate on something I said yesterday.  Recall that
on the one hand I felt that Elapsed time from system.time() is usually
good enough, but on the other hand I said that if one needs really fine
timing, one needs to see the times of the individual threads.  I
mentioned makespan but really should have also mentioned the issue of
load balancing.

Even though the end user arguably cares mainly about Elapsed time,
during development of a parallel algorithm one needs to identify sources
of slowdown.  That of course leads to investigating load imbalance
(among other things).  

So in finely-detailed timing experiments, one needs to know the
individual thread times.  Unfortunately, microbenchmark doesn't seem to
provide this.

Of course, the user on his own could insert code to call, say
proc.time() twice within each thread, and return the difference to the
parent, which could then separately report the times.  

But it would be nice to automate this.  It would seem to be fairly easy
to incorporate such timing (optional to the caller) within the various
snow functions, and probably so for something like mcapply() as well.

Norm
On 22 August 2012 at 23:22, Norm Matloff wrote:
| 
| In rereading your posting now, Dirk, I suddenly realized that there is
| one aspect of this that I'd forgotten about:  An ordinary call to
| system.time() does not display all the information returned by that
| function!
| 
| That odd statement is of course due to the fact that the print
| method for objects of class proc_time displays only 3 of the 5 numbers.
| If one actually looks at the 5 numbers individually, you can separate
| the time of the parent process from the sum of the child times.  That
| separation is apparently what rbenchmark gives you, right?
| 
| As I said earlier, the quick-and-dirty way to handle this is to use the
| Elapsed time, typically good enough (say on a dedicated machine).  After
| all, if we are trying to develop a fast parallel algorithm, what the
| potential users of the algorithm care about is essentially the Elapsed
| time.

That seems fair in most cases.

| But at the other extreme, a very fine timing goal might be to try to
| compute what is called the makespan, which in this case would be the
| maximum of all the child times, rather than the sum of the child times.
| I say "try," because I don't see any systems way to accomplish this,
| short of inserting calls to something like clock_gettime() inside each
| thread.

Maybe you could look at what microbenchmark does [ as it covers all the
OS-level dirty work ] and see if it generalizes to multiple machines?

Dirk

| Norm
| 
| On Wed, Aug 22, 2012 at 07:53:02PM -0500, Dirk Eddelbuettel wrote:
| > 
| > The difference between user and elapsed is an old hat. Here is a great
| > example (and IIRC first shown here by Simon) with no compute time:
| > 
| >    R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by default
| >       user  system elapsed 
| >      0.000   0.012   4.014 
| >    R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
| >       user  system elapsed 
| >      0.012   0.020   1.039 
| >    R> 
| > 
| > so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
| > overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
| > Ripley's choice a default of two is baked-in, so clueless users only get a
| > small gain.  "user time" is roughly the actual system load _summed over all
| > processes / threads_.
| > 
| > With that, could I ask any of the participants in the thread to re-try with a
| > proper benchmarking package such as rbenchmark or microbenchmark?  Either one
| > beats to the socks of system.time:
| > 
| >    R> library(rbenchmark)
| >    R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), replications=1)
| >                                                       test replications elapsed relative user.self sys.self user.child sys.child
| >    1               mclapply(1:8, function(x) Sys.sleep(1))            1   4.013  3.89612     0.000    0.008      0.000     0.004
| >    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1   1.030  1.00000     0.004    0.008      0.004     0.000
| >    R> 
| > 
| > and
| > 
| >    R> library(microbenchmark)
| >    R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
| >    Unit: seconds
| >                                                       expr     min      lq  median      uq     max
| >    1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377 4.01377 4.01377 4.01377
| >    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457 1.03457 1.03457 1.03457
| >    R> 
| > 
| > (and you normally want to run either with 10 or 100 or ... replications /
| > times).
| > 
| > Dirk
| > 
| > -- 
| > Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
| > 
| > _______________________________________________
| > R-sig-hpc mailing list
| > R-sig-hpc at r-project.org
| > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
| 
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

-- 
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
If you're on Linux on bare metal, you can do all sorts of nifty
micro-benchmarking via Sysstat, Oprofile and Blktrace. It's been a few
years since I did any of this, but I probably still have some
hacked-up scripts on Github for Oprofile (CPU bottlenecks), Blktrace
(disk bottlenecks) and Sysstat (telling which one you have).
Thanks for pointing me to microbenchmark, Dirk.  I had not been aware of
it, and it definitely looks useful.  Among other things, it uses the
clock_gettime() function I mentioned yesterday, so this now would give
me a convenient wrapper for it when I'm working at the R level.

However, I should elaborate on something I said yesterday.  Recall that
on the one hand I felt that Elapsed time from system.time() is usually
good enough, but on the other hand I said that if one needs really fine
timing, one needs to see the times of the individual threads.  I
mentioned makespan but really should have also mentioned the issue of
load balancing.

Even though the end user arguably cares mainly about Elapsed time,
during development of a parallel algorithm one needs to identify sources
of slowdown.  That of course leads to investigating load imbalance
(among other things).

So in finely-detailed timing experiments, one needs to know the
individual thread times.  Unfortunately, microbenchmark doesn't seem to
provide this.

Of course, the user on his own could insert code to call, say
proc.time() twice within each thread, and return the difference to the
parent, which could then separately report the times.

But it would be nice to automate this.  It would seem to be fairly easy
to incorporate such timing (optional to the caller) within the various
snow functions, and probably so for something like mcapply() as well.

Norm

On Thu, Aug 23, 2012 at 08:14:50AM -0500, Dirk Eddelbuettel wrote:
On 22 August 2012 at 23:22, Norm Matloff wrote:
|
| In rereading your posting now, Dirk, I suddenly realized that there is
| one aspect of this that I'd forgotten about:  An ordinary call to
| system.time() does not display all the information returned by that
| function!
|
| That odd statement is of course due to the fact that the print
| method for objects of class proc_time displays only 3 of the 5 numbers.
| If one actually looks at the 5 numbers individually, you can separate
| the time of the parent process from the sum of the child times.  That
| separation is apparently what rbenchmark gives you, right?
|
| As I said earlier, the quick-and-dirty way to handle this is to use the
| Elapsed time, typically good enough (say on a dedicated machine).  After
| all, if we are trying to develop a fast parallel algorithm, what the
| potential users of the algorithm care about is essentially the Elapsed
| time.

That seems fair in most cases.

| But at the other extreme, a very fine timing goal might be to try to
| compute what is called the makespan, which in this case would be the
| maximum of all the child times, rather than the sum of the child times.
| I say "try," because I don't see any systems way to accomplish this,
| short of inserting calls to something like clock_gettime() inside each
| thread.

Maybe you could look at what microbenchmark does [ as it covers all the
OS-level dirty work ] and see if it generalizes to multiple machines?

Dirk

| Norm
|
| On Wed, Aug 22, 2012 at 07:53:02PM -0500, Dirk Eddelbuettel wrote:
| >
| > The difference between user and elapsed is an old hat. Here is a great
| > example (and IIRC first shown here by Simon) with no compute time:
| >
| >    R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by default
| >       user  system elapsed
| >      0.000   0.012   4.014
| >    R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
| >       user  system elapsed
| >      0.012   0.020   1.039
| >    R>
| >
| > so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
| > overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
| > Ripley's choice a default of two is baked-in, so clueless users only get a
| > small gain.  "user time" is roughly the actual system load _summed over all
| > processes / threads_.
| >
| > With that, could I ask any of the participants in the thread to re-try with a
| > proper benchmarking package such as rbenchmark or microbenchmark?  Either one
| > beats to the socks of system.time:
| >
| >    R> library(rbenchmark)
| >    R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), replications=1)
| >                                                       test replications elapsed relative user.self sys.self user.child sys.child
| >    1               mclapply(1:8, function(x) Sys.sleep(1))            1   4.013  3.89612     0.000    0.008      0.000     0.004
| >    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1   1.030  1.00000     0.004    0.008      0.004     0.000
| >    R>
| >
| > and
| >
| >    R> library(microbenchmark)
| >    R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
| >    Unit: seconds
| >                                                       expr     min      lq  median      uq     max
| >    1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377 4.01377 4.01377 4.01377
| >    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457 1.03457 1.03457 1.03457
| >    R>
| >
| > (and you normally want to run either with 10 or 100 or ... replications /
| > times).
| >
| > Dirk
| >
| > --
| > Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
| >
| > _______________________________________________
| > R-sig-hpc mailing list
| > R-sig-hpc at r-project.org
| > https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
|
| _______________________________________________
| R-sig-hpc mailing list
| R-sig-hpc at r-project.org
| https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

--
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Twitter: http://twitter.com/znmeb; Computational Journalism Publishers
Workbench: http://j.mp/QCsXOr

How the Hell can the lion sleep with all those people singing "A weem
oh way!" at the top of their lungs?
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-hpc/attachments/20120824/409727a3/attachment.pl>
I have some additional questions: I want to add an extra matrix that I keep fixed, but I cannot get the parallel version working, any suggestions?

y<-rnorm(1000)
x<-matrix(rnorm(1000*10000),ncol=10000)
fix<-matrix(rnorm(1000*10),ncol=10)
#Non-parallel version works fine
pval<-apply(x,2, function(x)summary(lm(y~x+fix))$coeff[2,4])

library(parallel)
cores<-detectCores()
cl <- makeCluster(cores, methods=FALSE)
clusterExport(cl,"y","fix")
pvalstruct<-parApply(cl, x,2, function(x)summary(lm(y~x+fix))$coeff[2,4])
the easiest is to write function in *apply to accept all necessary 
arguments, e.g.,

   fun = function(x, fix) summary(lm(y~x+fix))$coeff[2,4]
   parApply(cl, x, 2, fun, fix)

Martin

Patrik

"Patrik Waldmann" <patrik.waldmann at boku.ac.at> 23/08/2012 12:12 >>>
Here's a comparison on Windows based on 8 cores (excluding foreach):
y<-rnorm(1000)
x<-matrix(rnorm(1000*10000),ncol=10000)
dimx<-dim(x)
library(rbenchmark)
benchmark(pval<-apply(x,2, function(x)summary(lm(y~x))$coeff[2,4]), replications=1)
1 pval <- apply(x, 2, function(x) summary(lm(y ~ x))$coeff[2, 4])
test replications elapsed relative user.self sys.self user.child sys.child
1    1   25.16        1     20.46     2.17         NA        NA

library(parallel)
cores<-detectCores()
cl <- makeCluster(cores, methods=FALSE)
clusterExport(cl,"y")
benchmark(pval<-parApply(cl, x,2, function(x)summary(lm(y~x))$coeff[2,4]), replications=1)
1 pval <- parApply(cl, x, 2, function(x) summary(lm(y ~ x))$coeff[2, 4])
   test replications elapsed relative user.self sys.self user.child sys.child
1      1    5.52        1      0.74     0.28         NA        NA

stopCluster(cl)
# More fair

benchmark({cores<-detectCores()
+ cl <- makeCluster(cores, methods=FALSE)
+ clusterExport(cl,"y")
+ pval<-parApply(cl, x,2, function(x)summary(lm(y~x))$coeff[2,4])}, replications=1)
   test replications elapsed relative user.self sys.self user.child sys.child
1    {            1    7.11        1      0.65     0.37         NA        NA
Warning messages:
1: closing unused connection 10 (<-patwa-PC:10187)
2: closing unused connection 9 (<-patwa-PC:10187)
3: closing unused connection 8 (<-patwa-PC:10187)
4: closing unused connection 7 (<-patwa-PC:10187)
5: closing unused connection 6 (<-patwa-PC:10187)
6: closing unused connection 5 (<-patwa-PC:10187)
7: closing unused connection 4 (<-patwa-PC:10187)
8: closing unused connection 3 (<-patwa-PC:10187)
stopCluster(cl)
What does the warnings refer to?

Patrik

Dirk Eddelbuettel <edd at debian.org> 23/08/2012 02:53 >>>
The difference between user and elapsed is an old hat. Here is a great
example (and IIRC first shown here by Simon) with no compute time:

    R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by default
       user  system elapsed
      0.000   0.012   4.014
    R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
       user  system elapsed
      0.012   0.020   1.039
    R>

so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
Ripley's choice a default of two is baked-in, so clueless users only get a
small gain.  "user time" is roughly the actual system load _summed over all
processes / threads_.

With that, could I ask any of the participants in the thread to re-try with a
proper benchmarking package such as rbenchmark or microbenchmark?  Either one
beats to the socks of system.time:

    R> library(rbenchmark)
    R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), replications=1)
                                                       test replications elapsed relative user.self sys.self user.child sys.child
    1               mclapply(1:8, function(x) Sys.sleep(1))            1   4.013  3.89612     0.000    0.008      0.000     0.004
    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1   1.030  1.00000     0.004    0.008      0.004     0.000
    R>

and

    R> library(microbenchmark)
    R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
    Unit: seconds
                                                       expr     min      lq  median      uq     max
    1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377 4.01377 4.01377 4.01377
    2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457 1.03457 1.03457 1.03457
    R>

(and you normally want to run either with 10 or 100 or ... replications /
times).

Dirk

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
Try
clusterExport(cl,c("y","fix"))

Hao
I have some additional questions: I want to add an extra matrix that I
keep fixed, but I cannot get the parallel version working, any
suggestions?

y<-rnorm(1000)
x<-matrix(rnorm(1000*10000),ncol=10000)
fix<-matrix(rnorm(1000*10),ncol=10)
#Non-parallel version works fine
pval<-apply(x,2, function(x)summary(lm(y~x+fix))$coeff[2,4])

library(parallel)
cores<-detectCores()
cl <- makeCluster(cores, methods=FALSE)
clusterExport(cl,"y","fix")
pvalstruct<-parApply(cl, x,2, function(x)summary(lm(y~x+fix))$coeff[2,4])

Patrik

"Patrik Waldmann" <patrik.waldmann at boku.ac.at> 23/08/2012 12:12 >>>
Here's a comparison on Windows based on 8 cores (excluding foreach):
y<-rnorm(1000)
x<-matrix(rnorm(1000*10000),ncol=10000)
dimx<-dim(x)
library(rbenchmark)
benchmark(pval<-apply(x,2, function(x)summary(lm(y~x))$coeff[2,4]),
replications=1)
1 pval <- apply(x, 2, function(x) summary(lm(y ~ x))$coeff[2, 4])
test replications elapsed relative user.self sys.self user.child sys.child
1    1   25.16        1     20.46     2.17         NA        NA

library(parallel)
cores<-detectCores()
cl <- makeCluster(cores, methods=FALSE)
clusterExport(cl,"y")
benchmark(pval<-parApply(cl, x,2,
function(x)summary(lm(y~x))$coeff[2,4]), replications=1)
1 pval <- parApply(cl, x, 2, function(x) summary(lm(y ~ x))$coeff[2, 4])
  test replications elapsed relative user.self sys.self user.child
sys.child
1      1    5.52        1      0.74     0.28         NA        NA

stopCluster(cl)
# More fair

benchmark({cores<-detectCores()
+ cl <- makeCluster(cores, methods=FALSE)
+ clusterExport(cl,"y")
+ pval<-parApply(cl, x,2, function(x)summary(lm(y~x))$coeff[2,4])},
replications=1)
  test replications elapsed relative user.self sys.self user.child
sys.child
1    {            1    7.11        1      0.65     0.37         NA
NA
Warning messages:
1: closing unused connection 10 (<-patwa-PC:10187)
2: closing unused connection 9 (<-patwa-PC:10187)
3: closing unused connection 8 (<-patwa-PC:10187)
4: closing unused connection 7 (<-patwa-PC:10187)
5: closing unused connection 6 (<-patwa-PC:10187)
6: closing unused connection 5 (<-patwa-PC:10187)
7: closing unused connection 4 (<-patwa-PC:10187)
8: closing unused connection 3 (<-patwa-PC:10187)
stopCluster(cl)
What does the warnings refer to?

Patrik

Dirk Eddelbuettel <edd at debian.org> 23/08/2012 02:53 >>>
The difference between user and elapsed is an old hat. Here is a great
example (and IIRC first shown here by Simon) with no compute time:

   R> system.time(mclapply(1:8, function(x) Sys.sleep(1)))   ## 2 cores by
default
      user  system elapsed
     0.000   0.012   4.014
   R> system.time(mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8))
      user  system elapsed
     0.012   0.020   1.039
   R>

so elapsed time is effectively the one second a Sys.sleep(1) takes, plus
overhead, if we allow for all eight (hyperthreaded) cores here.  By Brian
Ripley's choice a default of two is baked-in, so clueless users only get a
small gain.  "user time" is roughly the actual system load _summed over
all
processes / threads_.

With that, could I ask any of the participants in the thread to re-try
with a
proper benchmarking package such as rbenchmark or microbenchmark?  Either
one
beats to the socks of system.time:

   R> library(rbenchmark)
   R> benchmark( mclapply(1:8, function(x) Sys.sleep(1)), mclapply(1:8,
function(x) Sys.sleep(1), mc.cores=8), replications=1)
                                                      test replications
elapsed relative
user.self sys.self
user.child sys.child
   1               mclapply(1:8, function(x) Sys.sleep(1))            1
4.013  3.89612     0.000    0.008      0.000     0.004
   2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8)            1
1.030  1.00000     0.004    0.008      0.004     0.000
   R>

and

   R> library(microbenchmark)
   R> microbenchmark( mclapply(1:8, function(x) Sys.sleep(1)),
mclapply(1:8, function(x) Sys.sleep(1), mc.cores=8), times=1)
   Unit: seconds
                                                      expr     min      lq
 median      uq
max
   1               mclapply(1:8, function(x) Sys.sleep(1)) 4.01377 4.01377
4.01377 4.01377 4.01377
   2 mclapply(1:8, function(x) Sys.sleep(1), mc.cores = 8) 1.03457 1.03457
1.03457 1.03457 1.03457
   R>

(and you normally want to run either with 10 or 100 or ... replications /
times).

Dirk

--
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

[[alternative HTML version deleted]]

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

Department of Statistics & Actuarial Sciences
Office Phone#:(519)-661-3622
Fax Phone#:(519)-661-3813
The University of Western Ontario
London, Ontario N6A 5B7
http://www.stats.uwo.ca/yu