long format - find age when another variable is first 'high'
On May 25, 2009, at 7:45 AM, David Freedman wrote:
Dear R, I've got a data frame with children examined multiple times and at various ages. I'm trying to find the first age at which another variable (LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never happen. I can do this with transformBy and ddply, but with 10,000 different children, these functions take some time on my PCs - is there a faster way to do this in R? My code on a small dataset follows. Thanks very much, David Freedman d<-data.frame(id=c(rep(1,3),rep(2,2), 3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160)) d$high.ldlc<-ifelse(d$ldlc>=130,1,0) d library(plyr) d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1])); library(doBy) d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1])); d2
The first thing that I would do is to get rid of records that are not relevant to your question: > d id age ldlc high.ldlc 1 1 5 132 1 2 1 10 120 0 3 1 15 125 0 4 2 4 105 0 5 2 7 142 1 6 3 12 160 1 # Get records with high ldl d.new <- subset(d, ldlc >= 130) > d.new id age ldlc high.ldlc 1 1 5 132 1 5 2 7 142 1 6 3 12 160 1 That will help to reduce the total size of the dataset, perhaps substantially. It will also remove entire subjects that are not relevant (eg. never have LDL >= 130). Then get the minimum age for each of the remaining subjects: > aggregate(d.new$age, list(id = d.new$id), min) id x 1 1 5 2 2 7 3 3 12 Try that to see what sort of time reduction you observe. HTH, Marc Schwartz