Skip to content

the first and last observation for each subject

9 messages · William Dunlap, Hadley Wickham, Kingsford Jones

#
wrote:
time.
observed y
subject
The above is much quicker than the versions based on aggregate and
easy to understand.  Another approach is more specialized but useful
when you have lots of ID's (e.g., millions) and speed is very important.
It computes where the first and last entry for each ID in a vectorized
computation, akin to the computation that rle() uses:

f0 <- 
function(DF){
   changes <- DF$ID[-1] != DF$ID[-length(DF$ID)]
   first <- c(TRUE, changes)
   last <- c(changes, TRUE)
   ydiff <- DF$y[last] - DF$y[first]
   DF <- DF[first,]
   DF$y <- ydiff
   DF
}


Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
plyr does make some optimisations to increase speed and decrease
memory usage (mainly by passing around lists of indices, rather than
lists of the original objects) but it's unlikely ever to approach the
speed of a pure vector approach (although I hope to put some time into
rewriting the slow parts in C to do better with performance).
I particularly this solution to the problem - it's a very handy
technique, and while it takes a while to get your head around how it
works, it's worthwhile spending the time to do so because it crops up
as a useful solution to many similar types of problems. (It can be
particularly useful in excel too, as a quick way of locating
boundaries between groups)

Hadley
#
Oops, that should be particularly like, of course!
Hadley
#
Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
   o<-order(group, x)
   group <- group[o]
   x <- x[o]
   changes <- group[-1] != group[-length(group)]
   first <- which(c(TRUE, changes))
   last <- which(c(changes, TRUE))
   lowerMedian <- x[floor((first+last)/2)]
   upperMedian <- x[ceiling((first+last)/2)]
   median <- (lowerMedian+upperMedian)/2
   names(median) <- group[first]
   median
} 

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:
user  system elapsed 
   2.72    0.00    3.20
user  system elapsed 
   0.12    0.00    0.16
[1] TRUE

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
I get:
user  system elapsed
  2.733   0.017   2.766
user  system elapsed
  2.897   0.032   2.946


Hadley
#
Arg, the 'sapply(...)' in the function was in the initial
comment,
   gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply 
   gm <- function(x, group){ # medians by group:
   sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
#
Here's some more timing's of Bill's function.  Although in this
example sapply has a clear performance advantage for smaller numbers
of groups (k) , gm is substantially faster for k >> 1000:

 gm <- function(x, group){ # medians by group:
   o<-order(group, x)
   group <- group[o]
   x <- x[o]
   changes <- group[-1] != group[-length(group)]
   first <- which(c(TRUE, changes))
   last <- which(c(changes, TRUE))
   lowerMedian <- x[floor((first+last)/2)]
   upperMedian <- x[ceiling((first+last)/2)]
   median <- (lowerMedian+upperMedian)/2
   names(median) <- group[first]
   median
 }

 ##

 for(k in 10^(1:6)){
  group<-sample(1:k, size=100000, replace=TRUE)
  x<-rnorm(length(group))*10 + group
  cat('number of groups:', k, '\n')
  cat('sapply\n')
  print(s <- unix.time(sapply(split(x,group), median)))
  cat('gm\n')
  print(g <- unix.time(-gm(x,group)))
  cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
 }

number of groups: 10
sapply
   user  system elapsed
   0.01    0.00    0.01
gm
   user  system elapsed
   0.14    0.00    0.16
 sapply/gm user ratio: 0.07142857


number of groups: 100
sapply
   user  system elapsed
   0.02    0.00    0.02
gm
   user  system elapsed
   0.11    0.00    0.09
 sapply/gm user ratio: 0.1818182


number of groups: 1000
sapply
   user  system elapsed
   0.11    0.00    0.11
gm
   user  system elapsed
   0.11    0.00    0.11
 sapply/gm user ratio: 1


number of groups: 10000
sapply
   user  system elapsed
   1.00    0.00    1.04
gm
   user  system elapsed
   0.10    0.00    0.09
 sapply/gm user ratio: 10


number of groups: 1e+05
sapply
   user  system elapsed
   6.24    0.01    6.31
gm
   user  system elapsed
   0.16    0.00    0.17
 sapply/gm user ratio: 39


number of groups: 1e+06
sapply
   user  system elapsed
   9.00    0.03    8.92
gm
   user  system elapsed
   0.15    0.00    0.20
 sapply/gm user ratio: 60
$platform
[1] "i386-pc-mingw32"

$arch
[1] "i386"

$os
[1] "mingw32"

$system
[1] "i386, mingw32"

$status
[1] ""

$major
[1] "2"

$minor
[1] "8.0"

$year
[1] "2008"

$month
[1] "10"

$day
[1] "20"

$`svn rev`
[1] "46754"

$language
[1] "R"

$version.string
[1] "R version 2.8.0 (2008-10-20)"


Kingsford Jones
On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdunlap at tibco.com> wrote:
#
whoops -- I left the group size unchanged so k became greather than
the length of the group vector.  When I increase the size to 1e7,
sapply is faster until it gets to k = 1e6.

warning:  this takes awhile (particularly on my machine which seems to
be using just 1 of it's 2 cpus)
+   group<-sample(1:k, size=1e7, replace=TRUE)
+   x<-rnorm(length(group))*10 + group
+   cat('number of groups:', k, '\n')
+   cat('sapply\n')
+   print(s <- unix.time(sapply(split(x,group), median)))
+   cat('gm\n')
+   print(g <- unix.time(-gm(x,group)))
+   cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
+  }
number of groups: 10
sapply
   user  system elapsed
   1.18    0.38    1.56
gm
   user  system elapsed
  53.43    0.70   55.05
 sapply/gm user ratio: 0.02208497


number of groups: 100
sapply
   user  system elapsed
   1.17    0.23    1.42
gm
   user  system elapsed
  49.89    0.61   51.60
 sapply/gm user ratio: 0.02345159


number of groups: 1000
sapply
   user  system elapsed
   1.29    0.25    1.72
gm
   user  system elapsed
  45.23    0.34   46.55
 sapply/gm user ratio: 0.02852089


number of groups: 10000
sapply
   user  system elapsed
   2.80    0.09    2.87
gm
   user  system elapsed
  41.04    0.55   42.85
 sapply/gm user ratio: 0.06822612


number of groups: 1e+05
sapply
   user  system elapsed
  14.65    0.16   15.18
gm
   user  system elapsed
  38.28    0.65   39.55
 sapply/gm user ratio: 0.3827064


number of groups: 1e+06
sapply
   user  system elapsed
 126.63    0.42  129.21
gm
   user  system elapsed
  37.56    0.84   38.78
 sapply/gm user ratio: 3.371406

On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones
<kingsfordjones at gmail.com> wrote:
#
Just in case anyone is still interested, here are some
comparisons of the time it says to compute grouped medians
via sapply(split(x,group),median) and gm(x,group), which
uses the trick used by rle() to find the first and last
entries in each group.

Which method is fastest depends on the nature of your data:
the number of groups, k, the length of the data vector, n,
and probably the distribution of group size (which I didn't
look into).

If the overhead involved in calling a function were to go down
we would expect sapply() to look better in the cases where k/n
is close to 1.
   Bill

From: William Dunlap
Sent: Monday, January 05, 2009 4:01 PM
To: 'Kingsford Jones'
Subject: RE: [R] the first and last observation for each subject
Trying things out helps sort out the constants in those 'order of'
assertions.  Here are times for various n and k (both ranging over
4^(3:10), with k<=n) for gm and sapply:
k
n           64  256 1024 4096 16384 65536 262144 1048576
  64      0.00   NA   NA   NA    NA    NA     NA      NA
  256     0.00 0.00   NA   NA    NA    NA     NA      NA
  1024    0.00 0.00 0.00   NA    NA    NA     NA      NA
  4096    0.00 0.00 0.00 0.00    NA    NA     NA      NA
  16384   0.00 0.01 0.02 0.02  0.04    NA     NA      NA
  65536   0.08 0.06 0.07 0.06  0.05  0.09     NA      NA
  262144  0.36 0.36 0.33 0.34  0.28  0.26   0.42      NA
  1048576 3.46 2.83 1.88 1.91  1.96  1.39   1.17     2.2
k
n           64  256 1024 4096 16384 65536 262144 1048576
  64      0.01   NA   NA   NA    NA    NA     NA      NA
  256     0.01 0.03   NA   NA    NA    NA     NA      NA
  1024    0.02 0.03 0.09   NA    NA    NA     NA      NA
  4096    0.00 0.03 0.14 0.41    NA    NA     NA      NA
  16384   0.02 0.05 0.14 0.55  1.61    NA     NA      NA
  65536   0.02 0.03 0.14 0.55  2.22  6.54     NA      NA
  262144  0.03 0.05 0.17 0.60  2.28  8.93  27.44      NA
  1048576 0.27 0.16 0.27 0.71  2.41  9.22  37.56  121.55
k
n            64   256  1024  4096 16384 65536 262144 1048576
  64       TRUE    NA    NA    NA    NA    NA     NA      NA
  256      TRUE  TRUE    NA    NA    NA    NA     NA      NA
  1024     TRUE  TRUE  TRUE    NA    NA    NA     NA      NA
  4096    FALSE  TRUE  TRUE  TRUE    NA    NA     NA      NA
  16384    TRUE  TRUE  TRUE  TRUE  TRUE    NA     NA      NA
  65536   FALSE FALSE  TRUE  TRUE  TRUE  TRUE     NA      NA
  262144  FALSE FALSE FALSE  TRUE  TRUE  TRUE   TRUE      NA
  1048576 FALSE FALSE FALSE FALSE  TRUE  TRUE   TRUE    TRUE
k
n             64    256  1024  4096 16384 65536 262144 1048576
  64       0.000     NA    NA    NA    NA    NA     NA      NA
  256      0.000  0.000    NA    NA    NA    NA     NA      NA
  1024     0.000  0.000 0.000    NA    NA    NA     NA      NA
  4096       NaN  0.000 0.000 0.000    NA    NA     NA      NA
  16384    0.000  0.200 0.143 0.036 0.025    NA     NA      NA
  65536    4.000  2.000 0.500 0.109 0.023 0.014     NA      NA
  262144  12.000  7.200 1.941 0.567 0.123 0.029  0.015      NA
  1048576 12.815 17.688 6.963 2.690 0.813 0.151  0.031   0.018