Skip to content

extract data features from subsets

2 messages · Williams Scott, Dennis Murphy

#
I have a large dataset similar to this:

ID	time	result
A	1	5
A	2	2
A	3	1
A	4	1
A	5	1
A	6	2
A	7	3
A	8	4
B	1	3
B	2	2
B	3	4
B	4	6
B	5	8

I need to extract a number of features for each individual in it (identified by "ID"). These are:
* The lowest result (the nadir)
* The time of the nadir - but if the nadir level is present at >1 time point, I need the minimum and maximum time of nadir
* For the time period from maximum time of nadir to the last result, I need the coefficient from a lm(result~time) 

The result would be a table looking like:

ID	NadirLevel	NadirFirstTime	NadirLastTime 	Slope	
A	1		3			5			1
B	2		2			2			2

I can manage to extract all the required elements in a very cumbersome loop, but I am sure an elegant method using apply() or the like could be devised but I cant presently understand the necessary syntax. An suggestions appreciated.

Thanks 
Scott
_____________________________
?
Dr. Scott Williams
Peter MacCallum Cancer Centre
Melbourne, Australia
ph +61 3 9656 1111
fax +61 3 9656 1424
scott.williams at petermac.org?
?


This email (including any attachments or links) may contain 
confidential and/or legally privileged information and is 
intended only to be read or used by the addressee.  If you 
are not the intended addressee, any use, distribution, 
disclosure or copying of this email is strictly 
prohibited.  
Confidentiality and legal privilege attached to this email 
(including any attachments) are not waived or lost by 
reason of its mistaken delivery to you.
If you have received this email in error, please delete it 
and notify us immediately by telephone or email.  Peter 
MacCallum Cancer Centre provides no guarantee that this 
transmission is free of virus or that it has not been 
intercepted or altered and will not be liable for any delay 
in its receipt.
#
Hi:

Here's one way using package plyr and its ddply() function. ddply()
takes a data frame as input and expects to output either a scalar or a
data frame. In this case, we want the latter.

library(plyr)
f <- function(df) {
    mn <- min(df$result)
    tms <- df$time[df$result == mn]
    subdf <- df[max(tms):nrow(df), ]
    b1 <- coef(lm(result ~ time, data = subdf))[2]
    data.frame(NadirLevel = mn, NadirFirstTime = min(tms),
NadirLastTime = max(tms), Slope = b1)
  }

This function takes a data frame df as input - in practice, it will be
a sub-data frame associated with a level of ID. We find the minimum of
result and assign it to mn, and then find the times that match the
minimum.
Next, we construct the subdata on which to run the simple linear
regression line. Finally, an output data frame is created. ddply()
will add in the ID variable. Calling your example data frame d,
ID NadirLevel NadirFirstTime NadirLastTime Slope
1  A          1              3             5     1
2  B          2              2             2     2

HTH,
Dennis


On Mon, Jun 6, 2011 at 10:04 PM, Williams Scott
<Scott.Williams at petermac.org> wrote: