the first and last observation for each subject

[R] the first and last observation for each subject
hadley wickham h.wickham at gmail.com
Fri Jan 2 14:52:42 CET 2009

On Fri, Jan 2, 2009 at 3:20 AM, gallon li <gallon.li at gmail.com>
wrote:
I have the following data

ID x y time
1  10 20 0
1  10 30 1
1 10 40 2
2 12 23 0
2 12 25 1
2 12 28 2
2 12 38 3
3 5 10 0
3 5 15 2
.....

x is time invariant, ID is the subject id number, y is changing over
time.
I want to find out the difference between the first and last
observed y
value for each subject and get a table like

ID x y
1 10 20
2 12 15
3 5 5
......

Is there any easy way to generate the data set?
One approach is to use the plyr package, as documented at
http://had.co.nz/plyr.  The basic idea is that your problem is easy to
solve if you have a subset for a single subject value:

one <- subset(DF, ID == 1)
with(one, y[length(y)] - y[1])

The difficulty is splitting up the original dataset in to subjects,
applying the solution to each piece and then joining all the results
back together.  This is what the plyr package does for you:

library(plyr)

# ddply is for splitting up data frames and combining the results
# into a data frame.  .(ID) says to split up the data frame by the
subject
# variable
ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
...
The above is much quicker than the versions based on aggregate and
easy to understand.  Another approach is more specialized but useful
when you have lots of ID's (e.g., millions) and speed is very important.
It computes where the first and last entry for each ID in a vectorized
computation, akin to the computation that rle() uses:

f0 <- 
function(DF){
   changes <- DF$ID[-1] != DF$ID[-length(DF$ID)]
   first <- c(TRUE, changes)
   last <- c(changes, TRUE)
   ydiff <- DF$y[last] - DF$y[first]
   DF <- DF[first,]
   DF$y <- ydiff
   DF
}

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
library(plyr)

# ddply is for splitting up data frames and combining the results
# into a data frame.  .(ID) says to split up the data frame by the
subject
# variable
ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
...
The above is much quicker than the versions based on aggregate and
plyr does make some optimisations to increase speed and decrease
memory usage (mainly by passing around lists of indices, rather than
lists of the original objects) but it's unlikely ever to approach the
speed of a pure vector approach (although I hope to put some time into
rewriting the slow parts in C to do better with performance).
easy to understand.  Another approach is more specialized but useful
when you have lots of ID's (e.g., millions) and speed is very important.
It computes where the first and last entry for each ID in a vectorized
computation, akin to the computation that rle() uses:
I particularly this solution to the problem - it's a very handy
technique, and while it takes a while to get your head around how it
works, it's worthwhile spending the time to do so because it crops up
as a useful solution to many similar types of problems. (It can be
particularly useful in excel too, as a quick way of locating
boundaries between groups)

Hadley
http://had.co.nz/
easy to understand.  Another approach is more specialized but useful
when you have lots of ID's (e.g., millions) and speed is very important.
It computes where the first and last entry for each ID in a vectorized
computation, akin to the computation that rle() uses:
I particularly this solution to the problem - it's a very handy
Oops, that should be particularly like, of course!
technique, and while it takes a while to get your head around how it
works, it's worthwhile spending the time to do so because it crops up
as a useful solution to many similar types of problems. (It can be
particularly useful in excel too, as a quick way of locating
boundaries between groups)
Hadley
http://had.co.nz/
-----Original Message-----
From: hadley wickham [mailto:h.wickham at gmail.com] 
Sent: Sunday, January 04, 2009 8:56 PM
To: William Dunlap
Cc: gallon.li at gmail.com; R help
Subject: Re: [R] the first and last observation for each subject

library(plyr)

# ddply is for splitting up data frames and combining the results
# into a data frame.  .(ID) says to split up the data frame by the
subject
# variable
ddply(DF, .(ID), function(one) with(one, y[length(y)] - y[1]))
...
The above is much quicker than the versions based on aggregate and
plyr does make some optimisations to increase speed and decrease
memory usage (mainly by passing around lists of indices, rather than
lists of the original objects) but it's unlikely ever to approach the
speed of a pure vector approach (although I hope to put some time into
rewriting the slow parts in C to do better with performance).

easy to understand.  Another approach is more specialized but useful
when you have lots of ID's (e.g., millions) and speed is 
very important.
It computes where the first and last entry for each ID in a 
vectorized
computation, akin to the computation that rle() uses:
I particularly this solution to the problem - it's a very handy
technique, and while it takes a while to get your head around how it
works, it's worthwhile spending the time to do so because it crops up
as a useful solution to many similar types of problems. (It can be
particularly useful in excel too, as a quick way of locating
boundaries between groups)

Hadley

-- 
http://had.co.nz/
Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
   o<-order(group, x)
   group <- group[o]
   x <- x[o]
   changes <- group[-1] != group[-length(group)]
   first <- which(c(TRUE, changes))
   last <- which(c(changes, TRUE))
   lowerMedian <- x[floor((first+last)/2)]
   upperMedian <- x[ceiling((first+last)/2)]
   median <- (lowerMedian+upperMedian)/2
   names(median) <- group[first]
   median
} 

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:
group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))
user  system elapsed 
   2.72    0.00    3.20
unix.time(z1<-gm(x,group))
user  system elapsed 
   0.12    0.00    0.16
identical(z1,z0)
[1] TRUE

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
}

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:

group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))
  user  system elapsed
  2.72    0.00    3.20
unix.time(z1<-gm(x,group))
  user  system elapsed
  0.12    0.00    0.16
identical(z1,z0)
[1] TRUE
I get:
unix.time(z0<-sapply(split(x,group), median))
user  system elapsed
  2.733   0.017   2.766
unix.time(z1<-gm(x,group))
user  system elapsed
  2.897   0.032   2.946

Hadley
http://had.co.nz/
Arg, the 'sapply(...)' in the function was in the initial
comment,
   gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply 
   gm <- function(x, group){ # medians by group:
   sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
-----Original Message-----
From: hadley wickham [mailto:h.wickham at gmail.com] 
Sent: Monday, January 05, 2009 9:10 AM
To: William Dunlap
Cc: gallon.li at gmail.com; R help
Subject: Re: [R] the first and last observation for each subject

Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
}

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:

group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))
  user  system elapsed
  2.72    0.00    3.20
unix.time(z1<-gm(x,group))
  user  system elapsed
  0.12    0.00    0.16
identical(z1,z0)
[1] TRUE
I get:

unix.time(z0<-sapply(split(x,group), median))
   user  system elapsed
  2.733   0.017   2.766
unix.time(z1<-gm(x,group))
   user  system elapsed
  2.897   0.032   2.946

Hadley

-- 
http://had.co.nz/

Here's some more timing's of Bill's function.  Although in this
example sapply has a clear performance advantage for smaller numbers
of groups (k) , gm is substantially faster for k >> 1000:

 gm <- function(x, group){ # medians by group:
   o<-order(group, x)
   group <- group[o]
   x <- x[o]
   changes <- group[-1] != group[-length(group)]
   first <- which(c(TRUE, changes))
   last <- which(c(changes, TRUE))
   lowerMedian <- x[floor((first+last)/2)]
   upperMedian <- x[ceiling((first+last)/2)]
   median <- (lowerMedian+upperMedian)/2
   names(median) <- group[first]
   median
 }

 ##

 for(k in 10^(1:6)){
  group<-sample(1:k, size=100000, replace=TRUE)
  x<-rnorm(length(group))*10 + group
  cat('number of groups:', k, '\n')
  cat('sapply\n')
  print(s <- unix.time(sapply(split(x,group), median)))
  cat('gm\n')
  print(g <- unix.time(-gm(x,group)))
  cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
 }

number of groups: 10
sapply
   user  system elapsed
   0.01    0.00    0.01
gm
   user  system elapsed
   0.14    0.00    0.16
 sapply/gm user ratio: 0.07142857

number of groups: 100
sapply
   user  system elapsed
   0.02    0.00    0.02
gm
   user  system elapsed
   0.11    0.00    0.09
 sapply/gm user ratio: 0.1818182

number of groups: 1000
sapply
   user  system elapsed
   0.11    0.00    0.11
gm
   user  system elapsed
   0.11    0.00    0.11
 sapply/gm user ratio: 1

number of groups: 10000
sapply
   user  system elapsed
   1.00    0.00    1.04
gm
   user  system elapsed
   0.10    0.00    0.09
 sapply/gm user ratio: 10

number of groups: 1e+05
sapply
   user  system elapsed
   6.24    0.01    6.31
gm
   user  system elapsed
   0.16    0.00    0.17
 sapply/gm user ratio: 39

number of groups: 1e+06
sapply
   user  system elapsed
   9.00    0.03    8.92
gm
   user  system elapsed
   0.15    0.00    0.20
 sapply/gm user ratio: 60
R.Version()
$platform
[1] "i386-pc-mingw32"

$arch
[1] "i386"

$os
[1] "mingw32"

$system
[1] "i386, mingw32"

$status
[1] ""

$major
[1] "2"

$minor
[1] "8.0"

$year
[1] "2008"

$month
[1] "10"

$day
[1] "20"

$`svn rev`
[1] "46754"

$language
[1] "R"

$version.string
[1] "R version 2.8.0 (2008-10-20)"

Kingsford Jones
Arg, the 'sapply(...)' in the function was in the initial
comment,
  gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply
  gm <- function(x, group){ # medians by group:
  sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: hadley wickham [mailto:h.wickham at gmail.com]
Sent: Monday, January 05, 2009 9:10 AM
To: William Dunlap
Cc: gallon.li at gmail.com; R help
Subject: Re: [R] the first and last observation for each subject

Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
}

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:

group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))
  user  system elapsed
  2.72    0.00    3.20
unix.time(z1<-gm(x,group))
  user  system elapsed
  0.12    0.00    0.16
identical(z1,z0)
[1] TRUE
I get:

unix.time(z0<-sapply(split(x,group), median))
   user  system elapsed
  2.733   0.017   2.766
unix.time(z1<-gm(x,group))
   user  system elapsed
  2.897   0.032   2.946

Hadley

--
http://had.co.nz/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

whoops -- I left the group size unchanged so k became greather than
the length of the group vector.  When I increase the size to 1e7,
sapply is faster until it gets to k = 1e6.

warning:  this takes awhile (particularly on my machine which seems to
be using just 1 of it's 2 cpus)
 for(k in 10^(1:6)){
+   group<-sample(1:k, size=1e7, replace=TRUE)
+   x<-rnorm(length(group))*10 + group
+   cat('number of groups:', k, '\n')
+   cat('sapply\n')
+   print(s <- unix.time(sapply(split(x,group), median)))
+   cat('gm\n')
+   print(g <- unix.time(-gm(x,group)))
+   cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
+  }
number of groups: 10
sapply
   user  system elapsed
   1.18    0.38    1.56
gm
   user  system elapsed
  53.43    0.70   55.05
 sapply/gm user ratio: 0.02208497

number of groups: 100
sapply
   user  system elapsed
   1.17    0.23    1.42
gm
   user  system elapsed
  49.89    0.61   51.60
 sapply/gm user ratio: 0.02345159

number of groups: 1000
sapply
   user  system elapsed
   1.29    0.25    1.72
gm
   user  system elapsed
  45.23    0.34   46.55
 sapply/gm user ratio: 0.02852089

number of groups: 10000
sapply
   user  system elapsed
   2.80    0.09    2.87
gm
   user  system elapsed
  41.04    0.55   42.85
 sapply/gm user ratio: 0.06822612

number of groups: 1e+05
sapply
   user  system elapsed
  14.65    0.16   15.18
gm
   user  system elapsed
  38.28    0.65   39.55
 sapply/gm user ratio: 0.3827064

number of groups: 1e+06
sapply
   user  system elapsed
 126.63    0.42  129.21
gm
   user  system elapsed
  37.56    0.84   38.78
 sapply/gm user ratio: 3.371406

On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones
Here's some more timing's of Bill's function.  Although in this
example sapply has a clear performance advantage for smaller numbers
of groups (k) , gm is substantially faster for k >> 1000:

 gm <- function(x, group){ # medians by group:
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
 }

 ##

 for(k in 10^(1:6)){
 group<-sample(1:k, size=100000, replace=TRUE)
 x<-rnorm(length(group))*10 + group
 cat('number of groups:', k, '\n')
 cat('sapply\n')
 print(s <- unix.time(sapply(split(x,group), median)))
 cat('gm\n')
 print(g <- unix.time(-gm(x,group)))
 cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
 }

number of groups: 10
sapply
  user  system elapsed
  0.01    0.00    0.01
gm
  user  system elapsed
  0.14    0.00    0.16
 sapply/gm user ratio: 0.07142857

number of groups: 100
sapply
  user  system elapsed
  0.02    0.00    0.02
gm
  user  system elapsed
  0.11    0.00    0.09
 sapply/gm user ratio: 0.1818182

number of groups: 1000
sapply
  user  system elapsed
  0.11    0.00    0.11
gm
  user  system elapsed
  0.11    0.00    0.11
 sapply/gm user ratio: 1

number of groups: 10000
sapply
  user  system elapsed
  1.00    0.00    1.04
gm
  user  system elapsed
  0.10    0.00    0.09
 sapply/gm user ratio: 10

number of groups: 1e+05
sapply
  user  system elapsed
  6.24    0.01    6.31
gm
  user  system elapsed
  0.16    0.00    0.17
 sapply/gm user ratio: 39

number of groups: 1e+06
sapply
  user  system elapsed
  9.00    0.03    8.92
gm
  user  system elapsed
  0.15    0.00    0.20
 sapply/gm user ratio: 60

R.Version()
$platform
[1] "i386-pc-mingw32"

$arch
[1] "i386"

$os
[1] "mingw32"

$system
[1] "i386, mingw32"

$status
[1] ""

$major
[1] "2"

$minor
[1] "8.0"

$year
[1] "2008"

$month
[1] "10"

$day
[1] "20"

$`svn rev`
[1] "46754"

$language
[1] "R"

$version.string
[1] "R version 2.8.0 (2008-10-20)"

Kingsford Jones

On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap <wdunlap at tibco.com> wrote:
Arg, the 'sapply(...)' in the function was in the initial
comment,
  gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply
  gm <- function(x, group){ # medians by group:
  sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: hadley wickham [mailto:h.wickham at gmail.com]
Sent: Monday, January 05, 2009 9:10 AM
To: William Dunlap
Cc: gallon.li at gmail.com; R help
Subject: Re: [R] the first and last observation for each subject

Another application of that technique can be used to quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
}

For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
(in random order) the times are:

group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))
  user  system elapsed
  2.72    0.00    3.20
unix.time(z1<-gm(x,group))
  user  system elapsed
  0.12    0.00    0.16
identical(z1,z0)
[1] TRUE
I get:

unix.time(z0<-sapply(split(x,group), median))
   user  system elapsed
  2.733   0.017   2.766
unix.time(z1<-gm(x,group))
   user  system elapsed
  2.897   0.032   2.946

Hadley

--
http://had.co.nz/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Just in case anyone is still interested, here are some
comparisons of the time it says to compute grouped medians
via sapply(split(x,group),median) and gm(x,group), which
uses the trick used by rle() to find the first and last
entries in each group.

Which method is fastest depends on the nature of your data:
the number of groups, k, the length of the data vector, n,
and probably the distribution of group size (which I didn't
look into).

If the overhead involved in calling a function were to go down
we would expect sapply() to look better in the cases where k/n
is close to 1.
   Bill

From: William Dunlap
Sent: Monday, January 05, 2009 4:01 PM
To: 'Kingsford Jones'
Subject: RE: [R] the first and last observation for each subject
-----Original Message-----
From: Kingsford Jones [mailto:kingsfordjones at gmail.com] 
Sent: Monday, January 05, 2009 3:19 PM
To: William Dunlap
Subject: Re: [R] the first and last observation for each subject

Thanks for the explanation -- it's much more satisfying to have some
theory rather than 'just trying things out'...

Kingsford
Trying things out helps sort out the constants in those 'order of'
assertions.  Here are times for various n and k (both ranging over
4^(3:10), with k<=n) for gm and sapply:
time.gm
k
n           64  256 1024 4096 16384 65536 262144 1048576
  64      0.00   NA   NA   NA    NA    NA     NA      NA
  256     0.00 0.00   NA   NA    NA    NA     NA      NA
  1024    0.00 0.00 0.00   NA    NA    NA     NA      NA
  4096    0.00 0.00 0.00 0.00    NA    NA     NA      NA
  16384   0.00 0.01 0.02 0.02  0.04    NA     NA      NA
  65536   0.08 0.06 0.07 0.06  0.05  0.09     NA      NA
  262144  0.36 0.36 0.33 0.34  0.28  0.26   0.42      NA
  1048576 3.46 2.83 1.88 1.91  1.96  1.39   1.17     2.2
time.sapply
k
n           64  256 1024 4096 16384 65536 262144 1048576
  64      0.01   NA   NA   NA    NA    NA     NA      NA
  256     0.01 0.03   NA   NA    NA    NA     NA      NA
  1024    0.02 0.03 0.09   NA    NA    NA     NA      NA
  4096    0.00 0.03 0.14 0.41    NA    NA     NA      NA
  16384   0.02 0.05 0.14 0.55  1.61    NA     NA      NA
  65536   0.02 0.03 0.14 0.55  2.22  6.54     NA      NA
  262144  0.03 0.05 0.17 0.60  2.28  8.93  27.44      NA
  1048576 0.27 0.16 0.27 0.71  2.41  9.22  37.56  121.55
time.gm<time.sapply
k
n            64   256  1024  4096 16384 65536 262144 1048576
  64       TRUE    NA    NA    NA    NA    NA     NA      NA
  256      TRUE  TRUE    NA    NA    NA    NA     NA      NA
  1024     TRUE  TRUE  TRUE    NA    NA    NA     NA      NA
  4096    FALSE  TRUE  TRUE  TRUE    NA    NA     NA      NA
  16384    TRUE  TRUE  TRUE  TRUE  TRUE    NA     NA      NA
  65536   FALSE FALSE  TRUE  TRUE  TRUE  TRUE     NA      NA
  262144  FALSE FALSE FALSE  TRUE  TRUE  TRUE   TRUE      NA
  1048576 FALSE FALSE FALSE FALSE  TRUE  TRUE   TRUE    TRUE
round(time.gm/time.sapply,3)
k
n             64    256  1024  4096 16384 65536 262144 1048576
  64       0.000     NA    NA    NA    NA    NA     NA      NA
  256      0.000  0.000    NA    NA    NA    NA     NA      NA
  1024     0.000  0.000 0.000    NA    NA    NA     NA      NA
  4096       NaN  0.000 0.000 0.000    NA    NA     NA      NA
  16384    0.000  0.200 0.143 0.036 0.025    NA     NA      NA
  65536    4.000  2.000 0.500 0.109 0.023 0.014     NA      NA
  262144  12.000  7.200 1.941 0.567 0.123 0.029  0.015      NA
  1048576 12.815 17.688 6.963 2.690 0.813 0.151  0.031   0.018
On Mon, Jan 5, 2009 at 4:00 PM, William Dunlap 
<wdunlap at tibco.com> wrote:
Note that gm fully sorts the whole vector, taking order n*log(n)
time, then does an order n subscripting operation, then
some order k stuff.

The sapply approach does k calls to median, each of which
takes order n/k time to partially sort its short input (assuming
all groups are about the same size), resulting in order n total
sorting time (I think the partial sort is order n, if not it is
pretty close to that for n up to 2^25).
It also involves some other order n and order k time operations.

Thus for a fixed k, as n grows the sapply approach ought to
do better, but for a fixed n, as k grows gm ought to do better.
gm() only does better in some cases (when k/n is close to 1)
because it avoids the function call overhead involved in calling
median() many times.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: Kingsford Jones [mailto:kingsfordjones at gmail.com]
Sent: Monday, January 05, 2009 2:20 PM
To: William Dunlap
Cc: hadley wickham; R help
Subject: Re: [R] the first and last observation for each subject

whoops -- I left the group size unchanged so k became greather than
the length of the group vector.  When I increase the size to 1e7,
sapply is faster until it gets to k = 1e6.

warning:  this takes awhile (particularly on my machine 
which seems to
be using just 1 of it's 2 cpus)

 for(k in 10^(1:6)){
+   group<-sample(1:k, size=1e7, replace=TRUE)
+   x<-rnorm(length(group))*10 + group
+   cat('number of groups:', k, '\n')
+   cat('sapply\n')
+   print(s <- unix.time(sapply(split(x,group), median)))
+   cat('gm\n')
+   print(g <- unix.time(-gm(x,group)))
+   cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
+  }
number of groups: 10
sapply
   user  system elapsed
   1.18    0.38    1.56
gm
   user  system elapsed
  53.43    0.70   55.05
 sapply/gm user ratio: 0.02208497

number of groups: 100
sapply
   user  system elapsed
   1.17    0.23    1.42
gm
   user  system elapsed
  49.89    0.61   51.60
 sapply/gm user ratio: 0.02345159

number of groups: 1000
sapply
   user  system elapsed
   1.29    0.25    1.72
gm
   user  system elapsed
  45.23    0.34   46.55
 sapply/gm user ratio: 0.02852089

number of groups: 10000
sapply
   user  system elapsed
   2.80    0.09    2.87
gm
   user  system elapsed
  41.04    0.55   42.85
 sapply/gm user ratio: 0.06822612

number of groups: 1e+05
sapply
   user  system elapsed
  14.65    0.16   15.18
gm
   user  system elapsed
  38.28    0.65   39.55
 sapply/gm user ratio: 0.3827064

number of groups: 1e+06
sapply
   user  system elapsed
 126.63    0.42  129.21
gm
   user  system elapsed
  37.56    0.84   38.78
 sapply/gm user ratio: 3.371406

On Mon, Jan 5, 2009 at 2:37 PM, Kingsford Jones
<kingsfordjones at gmail.com> wrote:
Here's some more timing's of Bill's function.  Although in this
example sapply has a clear performance advantage for 
smaller numbers
of groups (k) , gm is substantially faster for k >> 1000:

 gm <- function(x, group){ # medians by group:
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
 }

 ##

 for(k in 10^(1:6)){
 group<-sample(1:k, size=100000, replace=TRUE)
 x<-rnorm(length(group))*10 + group
 cat('number of groups:', k, '\n')
 cat('sapply\n')
 print(s <- unix.time(sapply(split(x,group), median)))
 cat('gm\n')
 print(g <- unix.time(-gm(x,group)))
 cat(' sapply/gm user ratio:', s[1]/g[1], '\n\n\n')
 }

number of groups: 10
sapply
  user  system elapsed
  0.01    0.00    0.01
gm
  user  system elapsed
  0.14    0.00    0.16
 sapply/gm user ratio: 0.07142857

number of groups: 100
sapply
  user  system elapsed
  0.02    0.00    0.02
gm
  user  system elapsed
  0.11    0.00    0.09
 sapply/gm user ratio: 0.1818182

number of groups: 1000
sapply
  user  system elapsed
  0.11    0.00    0.11
gm
  user  system elapsed
  0.11    0.00    0.11
 sapply/gm user ratio: 1

number of groups: 10000
sapply
  user  system elapsed
  1.00    0.00    1.04
gm
  user  system elapsed
  0.10    0.00    0.09
 sapply/gm user ratio: 10

number of groups: 1e+05
sapply
  user  system elapsed
  6.24    0.01    6.31
gm
  user  system elapsed
  0.16    0.00    0.17
 sapply/gm user ratio: 39

number of groups: 1e+06
sapply
  user  system elapsed
  9.00    0.03    8.92
gm
  user  system elapsed
  0.15    0.00    0.20
 sapply/gm user ratio: 60

R.Version()
$platform
[1] "i386-pc-mingw32"

$arch
[1] "i386"

$os
[1] "mingw32"

$system
[1] "i386, mingw32"

$status
[1] ""

$major
[1] "2"

$minor
[1] "8.0"

$year
[1] "2008"

$month
[1] "10"

$day
[1] "20"

$`svn rev`
[1] "46754"

$language
[1] "R"

$version.string
[1] "R version 2.8.0 (2008-10-20)"

Kingsford Jones

On Mon, Jan 5, 2009 at 10:18 AM, William Dunlap
<wdunlap at tibco.com> wrote:
Arg, the 'sapply(...)' in the function was in the initial
comment,
  gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply
  gm <- function(x, group){ # medians by group:
  sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

-----Original Message-----
From: hadley wickham [mailto:h.wickham at gmail.com]
Sent: Monday, January 05, 2009 9:10 AM
To: William Dunlap
Cc: gallon.li at gmail.com; R help
Subject: Re: [R] the first and last observation for 
each subject

Another application of that technique can be used to
quickly compute
medians by groups:

gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
  o<-order(group, x)
  group <- group[o]
  x <- x[o]
  changes <- group[-1] != group[-length(group)]
  first <- which(c(TRUE, changes))
  last <- which(c(changes, TRUE))
  lowerMedian <- x[floor((first+last)/2)]
  upperMedian <- x[ceiling((first+last)/2)]
  median <- (lowerMedian+upperMedian)/2
  names(median) <- group[first]
  median
}

For a 10^5 long x and a somewhat fewer than 3*10^4
distinct groups
(in random order) the times are:

group<-sample(1:30000, size=100000, replace=TRUE)
x<-rnorm(length(group))*10 + group
unix.time(z0<-sapply(split(x,group), median))
  user  system elapsed
  2.72    0.00    3.20
unix.time(z1<-gm(x,group))
  user  system elapsed
  0.12    0.00    0.16
identical(z1,z0)
[1] TRUE
I get:

unix.time(z0<-sapply(split(x,group), median))
   user  system elapsed
  2.733   0.017   2.766
unix.time(z1<-gm(x,group))
   user  system elapsed
  2.897   0.032   2.946

Hadley

--
http://had.co.nz/

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, 
reproducible code.