I have a large dataset similar to this: ID time result A 1 5 A 2 2 A 3 1 A 4 1 A 5 1 A 6 2 A 7 3 A 8 4 B 1 3 B 2 2 B 3 4 B 4 6 B 5 8 I need to extract a number of features for each individual in it (identified by "ID"). These are: * The lowest result (the nadir) * The time of the nadir - but if the nadir level is present at >1 time point, I need the minimum and maximum time of nadir * For the time period from maximum time of nadir to the last result, I need the coefficient from a lm(result~time) The result would be a table looking like: ID NadirLevel NadirFirstTime NadirLastTime Slope A 1 3 5 1 B 2 2 2 2 I can manage to extract all the required elements in a very cumbersome loop, but I am sure an elegant method using apply() or the like could be devised but I cant presently understand the necessary syntax. An suggestions appreciated. Thanks Scott _____________________________ ? Dr. Scott Williams Peter MacCallum Cancer Centre Melbourne, Australia ph +61 3 9656 1111 fax +61 3 9656 1424 scott.williams at petermac.org? ? This email (including any attachments or links) may contain confidential and/or legally privileged information and is intended only to be read or used by the addressee. If you are not the intended addressee, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this email (including any attachments) are not waived or lost by reason of its mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. Peter MacCallum Cancer Centre provides no guarantee that this transmission is free of virus or that it has not been intercepted or altered and will not be liable for any delay in its receipt.
extract data features from subsets
2 messages · Williams Scott, Dennis Murphy
Hi:
Here's one way using package plyr and its ddply() function. ddply()
takes a data frame as input and expects to output either a scalar or a
data frame. In this case, we want the latter.
library(plyr)
f <- function(df) {
mn <- min(df$result)
tms <- df$time[df$result == mn]
subdf <- df[max(tms):nrow(df), ]
b1 <- coef(lm(result ~ time, data = subdf))[2]
data.frame(NadirLevel = mn, NadirFirstTime = min(tms),
NadirLastTime = max(tms), Slope = b1)
}
This function takes a data frame df as input - in practice, it will be
a sub-data frame associated with a level of ID. We find the minimum of
result and assign it to mn, and then find the times that match the
minimum.
Next, we construct the subdata on which to run the simple linear
regression line. Finally, an output data frame is created. ddply()
will add in the ID variable. Calling your example data frame d,
ddply(d, 'ID', f)
ID NadirLevel NadirFirstTime NadirLastTime Slope 1 A 1 3 5 1 2 B 2 2 2 2 HTH, Dennis On Mon, Jun 6, 2011 at 10:04 PM, Williams Scott
<Scott.Williams at petermac.org> wrote:
I have a large dataset similar to this: ID ? ? ?time ? ?result A ? ? ? 1 ? ? ? 5 A ? ? ? 2 ? ? ? 2 A ? ? ? 3 ? ? ? 1 A ? ? ? 4 ? ? ? 1 A ? ? ? 5 ? ? ? 1 A ? ? ? 6 ? ? ? 2 A ? ? ? 7 ? ? ? 3 A ? ? ? 8 ? ? ? 4 B ? ? ? 1 ? ? ? 3 B ? ? ? 2 ? ? ? 2 B ? ? ? 3 ? ? ? 4 B ? ? ? 4 ? ? ? 6 B ? ? ? 5 ? ? ? 8 I need to extract a number of features for each individual in it (identified by "ID"). These are: * The lowest result (the nadir) * The time of the nadir - but if the nadir level is present at >1 time point, I need the minimum and maximum time of nadir * For the time period from maximum time of nadir to the last result, I need the coefficient from a lm(result~time) The result would be a table looking like: ID ? ? ?NadirLevel ? ? ?NadirFirstTime ?NadirLastTime ? Slope A ? ? ? 1 ? ? ? ? ? ? ? 3 ? ? ? ? ? ? ? ? ? ? ? 5 ? ? ? ? ? ? ? ? ? ? ? 1 B ? ? ? 2 ? ? ? ? ? ? ? 2 ? ? ? ? ? ? ? ? ? ? ? 2 ? ? ? ? ? ? ? ? ? ? ? 2 I can manage to extract all the required elements in a very cumbersome loop, but I am sure an elegant method using apply() or the like could be devised but I cant presently understand the necessary syntax. An suggestions appreciated. Thanks Scott
_____________________________ Dr. Scott Williams Peter MacCallum Cancer Centre Melbourne, Australia ph +61 3 9656 1111 fax +61 3 9656 1424 scott.williams at petermac.org This email (including any attachments or links) may contain confidential and/or legally privileged information and is intended only to be read or used by the addressee. ?If you are not the intended addressee, any use, distribution, disclosure or copying of this email is strictly prohibited. Confidentiality and legal privilege attached to this email (including any attachments) are not waived or lost by reason of its mistaken delivery to you. If you have received this email in error, please delete it and notify us immediately by telephone or email. ?Peter MacCallum Cancer Centre provides no guarantee that this transmission is free of virus or that it has not been intercepted or altered and will not be liable for any delay in its receipt. ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.