Prev 166029 / 398502 Next

the first and last observation for each subject

Kingsford Jones

Mon, Jan 5, 2009 2:20 PM

whoops -- I left the group size unchanged so k became greather than
the length of the group vector.  When I increase the size to 1e7,
sapply is faster until it gets to k = 1e6.

warning:  this takes awhile (particularly on my machine which seems to
be using just 1 of it's 2 cpus)

+   group<-sample(1:k, size=1e7, replace=TRUE)
+   x<-rnorm(length(group))*10 + group
+   cat('number of groups:', k, '\n')
+   cat('sapply\n')
+   print(s <- unix.time(sapply(split(x,group), median)))
+   cat('gm\n')
+   print(g <- unix.time(-gm(x,group)))
+   cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
+  }
number of groups: 10
sapply
   user  system elapsed
   1.18    0.38    1.56
gm
   user  system elapsed
  53.43    0.70   55.05
 sapply/gm user ratio: 0.02208497


number of groups: 100
sapply
   user  system elapsed
   1.17    0.23    1.42
gm
   user  system elapsed
  49.89    0.61   51.60
 sapply/gm user ratio: 0.02345159


number of groups: 1000
sapply
   user  system elapsed
   1.29    0.25    1.72
gm
   user  system elapsed
  45.23    0.34   46.55
 sapply/gm user ratio: 0.02852089


number of groups: 10000
sapply
   user  system elapsed
   2.80    0.09    2.87
gm
   user  system elapsed
  41.04    0.55   42.85
 sapply/gm user ratio: 0.06822612


number of groups: 1e+05
sapply
   user  system elapsed
  14.65    0.16   15.18
gm
   user  system elapsed
  38.28    0.65   39.55
 sapply/gm user ratio: 0.3827064


number of groups: 1e+06
sapply
   user  system elapsed
 126.63    0.42  129.21
gm
   user  system elapsed
  37.56    0.84   38.78
 sapply/gm user ratio: 3.371406

On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones

<kingsfordjones at gmail.com> wrote:

Here's some more timing's of Bill's function.  Although in this
example sapply has a clear performance advantage for smaller numbers
of groups (k) , gm is substantially faster for k >> 1000:

 gm <- function(x, group){ # medians by group:
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
 }

 ##

 for(k in 10^(1:6)){
 group<-sample(1:k, size=100000, replace=TRUE)
 x<-rnorm(length(group))*10 + group
 cat('number of groups:', k, '\n')
 cat('sapply\n')
 print(s <- unix.time(sapply(split(x,group), median)))
 cat('gm\n')
 print(g <- unix.time(-gm(x,group)))
 cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
 }

number of groups: 10
sapply
  user  system elapsed
  0.01    0.00    0.01
gm
  user  system elapsed
  0.14    0.00    0.16
 sapply/gm user ratio: 0.07142857


number of groups: 100
sapply
  user  system elapsed
  0.02    0.00    0.02
gm
  user  system elapsed
  0.11    0.00    0.09
 sapply/gm user ratio: 0.1818182


number of groups: 1000
sapply
  user  system elapsed
  0.11    0.00    0.11
gm
  user  system elapsed
  0.11    0.00    0.11
 sapply/gm user ratio: 1


number of groups: 10000
sapply
  user  system elapsed
  1.00    0.00    1.04
gm
  user  system elapsed
  0.10    0.00    0.09
 sapply/gm user ratio: 10


number of groups: 1e+05
sapply
  user  system elapsed
  6.24    0.01    6.31
gm
  user  system elapsed
  0.16    0.00    0.17
 sapply/gm user ratio: 39


number of groups: 1e+06
sapply
  user  system elapsed
  9.00    0.03    8.92
gm
  user  system elapsed
  0.15    0.00    0.20
 sapply/gm user ratio: 60

R.Version()

$platform
[1] "i386-pc-mingw32"

$arch
[1] "i386"

$os
[1] "mingw32"

$system
[1] "i386, mingw32"

$status
[1] ""

$major
[1] "2"

$minor
[1] "8.0"

$year
[1] "2008"

$month
[1] "10"

$day
[1] "20"

$`svn rev`
[1] "46754"

$language
[1] "R"

$version.string
[1] "R version 2.8.0 (2008-10-20)"


Kingsford Jones



On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdunlap at tibco.com> wrote:

Arg, the 'sapply(...)' in the function was in the initial
comment,
  gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply
  gm <- function(x, group){ # medians by group:
  sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: hadley wickham [mailto:h.wickham at gmail.com]
Sent: Monday, January 05, 2009 9:10 AM
To: William Dunlap
Cc: gallon.li at gmail.com; R help
Subject: Re: [R] the first and last observation for each subject

Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
}

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:

group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))

  user  system elapsed
  2.72    0.00    3.20

unix.time(z1<-gm(x,group))

  user  system elapsed
  0.12    0.00    0.16

identical(z1,z0)

[1] TRUE

I get:

unix.time(z0<-sapply(split(x,group), median))

   user  system elapsed
  2.733   0.017   2.766

unix.time(z1<-gm(x,group))

   user  system elapsed
  2.897   0.032   2.946


Hadley


--
http://had.co.nz/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Thread (9 messages)

William Dunlap the first and last observation for each subject Jan 4 Hadley Wickham the first and last observation for each subject Jan 4 Hadley Wickham the first and last observation for each subject Jan 4 William Dunlap the first and last observation for each subject Jan 5 Hadley Wickham the first and last observation for each subject Jan 5 William Dunlap the first and last observation for each subject Jan 5 Kingsford Jones the first and last observation for each subject Jan 5 Kingsford Jones the first and last observation for each subject Jan 5 William Dunlap the first and last observation for each subject Jan 6