Back to formatted view
Raw Message

Message-ID: <77EB52C6DD32BA4D87471DCD70C8D7005FE3BA@NA-PA-VBE03.na.tibco.com>
Date: 2009-01-05T17:18:31Z
From: William Dunlap
Subject: the first and last observation for each subject
In-Reply-To: <f8e6ff050901050910w7dee1260i3dcaaf59f4bb6f75@mail.gmail.com>

Arg, the 'sapply(...)' in the function was in the initial
comment,
   gm <- function(x, group){ # medians by group:
sapply(split(x,group),median)
but someone's mailer put a newline before the sapply 
   gm <- function(x, group){ # medians by group:
   sapply(split(x,group),median)
so it got executed, wasting all that time.  (Of course the
same mailer will mess up this message in the same way -
just delete the sapply() call from gm().)

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com  

> -----Original Message-----
> From: hadley wickham [mailto:h.wickham at gmail.com] 
> Sent: Monday, January 05, 2009 9:10 AM
> To: William Dunlap
> Cc: gallon.li at gmail.com; R help
> Subject: Re: [R] the first and last observation for each subject
> 
> > Another application of that technique can be used to quickly compute
> > medians by groups:
> >
> > gm <- function(x, group){ # medians by group:
> > sapply(split(x,group),median)
> >   o<-order(group, x)
> >   group <- group[o]
> >   x <- x[o]
> >   changes <- group[-1] != group[-length(group)]
> >   first <- which(c(TRUE, changes))
> >   last <- which(c(changes, TRUE))
> >   lowerMedian <- x[floor((first+last)/2)]
> >   upperMedian <- x[ceiling((first+last)/2)]
> >   median <- (lowerMedian+upperMedian)/2
> >   names(median) <- group[first]
> >   median
> > }
> >
> > For a 10^5 long x and a somewhat fewer than 3*10^4 distinct groups
> > (in random order) the times are:
> >
> >> group<-sample(1:30000, size=100000, replace=TRUE)
> >> x<-rnorm(length(group))*10 + group
> >> unix.time(z0<-sapply(split(x,group), median))
> >   user  system elapsed
> >   2.72    0.00    3.20
> >> unix.time(z1<-gm(x,group))
> >   user  system elapsed
> >   0.12    0.00    0.16
> >> identical(z1,z0)
> > [1] TRUE
> 
> I get:
> 
> > unix.time(z0<-sapply(split(x,group), median))
>    user  system elapsed
>   2.733   0.017   2.766
> > unix.time(z1<-gm(x,group))
>    user  system elapsed
>   2.897   0.032   2.946
> 
> 
> Hadley
> 
> 
> -- 
> http://had.co.nz/
>