Skip to content

long format - find age when another variable is first 'high'

5 messages · David Freedman, ONKELINX, Thierry, Marc Schwartz +2 more

#
Dear R, 

I've got a data frame with children examined multiple times and at various
ages.  I'm trying to find the first age at which another variable
(LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never happen. 
I can do this with transformBy and ddply, but with 10,000 different
children, these functions take some time on my PCs - is there a faster way
to do this in R?  My code on a small dataset follows.  

Thanks very much, David Freedman

d<-data.frame(id=c(rep(1,3),rep(2,2),3),age=c(5,10,15,4,7,12),ldlc=c(132,120,125,105,142,160))
d$high.ldlc<-ifelse(d$ldlc>=130,1,0)
d
library(plyr)
d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1]));
library(doBy)
d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1]));
d2
#
Dear David,

You would speed up things is you first create a subset were all values
of ldlc is >= 130. Then you only have to find the lowest age for each
child in this subset.

HTH,

Thierry


------------------------------------------------------------------------
----
ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and Forest
Cel biometrie, methodologie en kwaliteitszorg / Section biometrics,
methodology and quality assurance
Gaverstraat 4
9500 Geraardsbergen
Belgium
tel. + 32 54/436 185
Thierry.Onkelinx at inbo.be
www.inbo.be

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to
say what the experiment died of.
~ Sir Ronald Aylmer Fisher

The plural of anecdote is not data.
~ Roger Brinner

The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey

-----Oorspronkelijk bericht-----
Van: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org]
Namens David Freedman
Verzonden: maandag 25 mei 2009 14:45
Aan: r-help at r-project.org
Onderwerp: [R] long format - find age when another variable is first
'high'


Dear R, 

I've got a data frame with children examined multiple times and at
various ages.  I'm trying to find the first age at which another
variable
(LDL-Cholesterol) is >= 130 mg/dL; for some children, this may never
happen. 
I can do this with transformBy and ddply, but with 10,000 different
children, these functions take some time on my PCs - is there a faster
way to do this in R?  My code on a small dataset follows.  

Thanks very much, David Freedman

d<-data.frame(id=c(rep(1,3),rep(2,2),3),age=c(5,10,15,4,7,12),ldlc=c(132
,120,125,105,142,160))
d$high.ldlc<-ifelse(d$ldlc>=130,1,0)
d
library(plyr)
d2<-ddply(d,~id,transform,plyr.minage=min(age[high.ldlc==1]));
library(doBy)
d2<-transformBy(~id,da=d2,doby.minage=min(age[high.ldlc==1]));
d2
--
View this message in context:
http://www.nabble.com/long-format---find-age-when-another-variable-is-fi
rst-%27high%27-tp23706393p23706393.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Dit bericht en eventuele bijlagen geven enkel de visie van de schrijver weer 
en binden het INBO onder geen enkel beding, zolang dit bericht niet bevestigd is
door een geldig ondertekend document. The views expressed in  this message 
and any annex are purely those of the writer and may not be regarded as stating 
an official position of INBO, as long as the message is not confirmed by a duly 
signed document.
#
On May 25, 2009, at 7:45 AM, David Freedman wrote:

            
The first thing that I would do is to get rid of records that are not  
relevant to your question:

 > d
id age ldlc high.ldlc
1  1   5  132         1
2  1  10  120         0
3  1  15  125         0
4  2   4  105         0
5  2   7  142         1
6  3  12  160         1


# Get records with high ldl
d.new <- subset(d, ldlc >= 130)


 > d.new
id age ldlc high.ldlc
1  1   5  132         1
5  2   7  142         1
6  3  12  160         1


That will help to reduce the total size of the dataset, perhaps  
substantially. It will also remove entire subjects that are not  
relevant (eg. never have LDL >= 130).

Then get the minimum age for each of the remaining subjects:

 > aggregate(d.new$age, list(id = d.new$id), min)
id  x
1  1  5
2  2  7
3  3 12


Try that to see what sort of time reduction you observe.

HTH,

Marc Schwartz
#
Depending on what you want (haven't checked the speed) you could try
this one where
we have changed the ldlc in the first row so that it has none > 130
for id=1 just to
illustrate that case as well:
+  ldlc=c(122, 120, 125, 105, 142, 160))
id age ldlc min_age
1  1   5  122    <NA>
2  1  10  120    <NA>
3  1  15  125    <NA>
4  2   4  105     7.0
5  2   7  142     7.0
6  3  12  160    12.0
id min_age
1  1    <NA>
2  2     7.0
3  3    12.0
id min(age)
1  2        7
2  3       12

See sqldf home page at:
http://sqldf.googlecode.com
On Mon, May 25, 2009 at 8:45 AM, David Freedman <3.14david at gmail.com> wrote:
1 day later
#
If the dataset has a lot of rows you can save more time
by replacing the call to aggregate(age,id,min) by code that sorts
the filtered data by 'id' then breaking ties with 'age', and
then picking out the elements just after a change in the
value of 'id':
    f <- function(d) {
         dSorted <- d[ order(d$id,d$age),]
         n <- length(d$id) # or nrow(d)
         dSorted[   c(TRUE, dSorted$id[-1] != dSorted$id[-n]), ]
    }
    f(d.new) # or f(d[d$ldlc>=130,]) to avoid leaving around the temp
variable.
If you know your dataset is already sorted in this way, you just
need only the last line of that function.

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com