An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090329/29a5ac93/attachment-0002.pl>
select observations from longitudinal data
5 messages · gallon li, Bill Venables, Peter Dalgaard +1 more
Let's tackle the bigger problem of doing this not just for time = 3 but for all times. First we start with your data frame:
dat
id time x 1 1 1 10 2 1 2 11 3 1 3 23 4 1 4 23 5 2 2 12 6 2 3 13 7 2 4 14 8 3 1 11 9 3 3 15 10 3 4 18 11 3 5 21 12 4 2 22 13 4 3 27 14 4 6 29
### Now put the data into an id x time matrix, with gaps:
mat <- with(dat, {
+ lev_id <- sort(unique(id)) + lev_tm <- sort(unique(time)) + M <- matrix(NA, length(lev_id), length(lev_tm)) + dimnames(M) <- list(id = lev_id, time = lev_tm) + M[cbind(match(id, lev_id), match(time, lev_tm))] <- x + M + })
mat
time id 1 2 3 4 5 6 1 10 11 23 23 NA NA 2 NA 12 13 14 NA NA 3 11 NA 15 18 21 NA 4 NA 22 27 NA NA 29
### Now do the replacements ### (this is a very questionable dodge, by the way!)
for(i in 2:nrow(mat))
+ if(any(k <- is.na(mat[i-1, ]))) + mat[i-1, k] <- mat[i, k]
mat
time id 1 2 3 4 5 6 1 10 11 23 23 NA NA 2 11 12 13 14 21 NA 3 11 22 15 18 21 29 4 NA 22 27 NA NA 29
### some gaps cannot be filled. ### now turn it back into a data frame; ### this is a splendid trick that no one knows about:
dat <- as.data.frame(as.table(mat), responseName = "x") dat <- with(dat, dat[order(id, time), ])
### this will look OK, but the first two columns are factors ### as we started with numeric variables (if we did) then ### it might be useful to turn them back to numerical variables ### again:
dat <- within(dat, {
+ id <- as.numeric(as.character(id)) + time <- as.numeric(as.character(time)) + })
dat
id time x 1 1 1 10 5 1 2 11 9 1 3 23 13 1 4 23 17 1 5 NA 21 1 6 NA 2 2 1 11 6 2 2 12 10 2 3 13 14 2 4 14 18 2 5 21 22 2 6 NA 3 3 1 11 7 3 2 22 11 3 3 15 15 3 4 18 19 3 5 21 23 3 6 29 4 4 1 NA 8 4 2 22 12 4 3 27 16 4 4 NA 20 4 5 NA 24 4 6 29
### As many gaps have been filled as can be filled (with fake data!). ### If you want to remove those still missing, you can use
dat <- na.omit(dat)
Bill Venables http://www.cmis.csiro.au/bill.venables/ -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of gallon li Sent: Sunday, 29 March 2009 6:51 PM To: r-help Subject: [R] select observations from longitudinal data Suppose I have a long format for a longitudinal data id time x 1 1 10 1 2 11 1 3 23 1 4 23 2 2 12 2 3 13 2 4 14 3 1 11 3 3 15 3 4 18 3 5 21 4 2 22 4 3 27 4 6 29 I want to select the x values for each ID when time is equal to 3. When that observation is not observed, then I want to replace it with the obervation at time equal to 4. otherwise just use NA. How can I implement this with a quick command? ______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
gallon li wrote:
Suppose I have a long format for a longitudinal data id time x 1 1 10 1 2 11 1 3 23 1 4 23 2 2 12 2 3 13 2 4 14 3 1 11 3 3 15 3 4 18 3 5 21 4 2 22 4 3 27 4 6 29 I want to select the x values for each ID when time is equal to 3. When that observation is not observed, then I want to replace it with the obervation at time equal to 4. otherwise just use NA.
with this dummy data:
data = read.table(header=TRUE, textConnection(open='r', '
id time x
2 2 2
2 3 3
2 4 4
2 5 5
3 3 3
3 4 4
3 5 5
4 4 4
4 5 5
5 5 5'))
you seem to expect the result to be like
# id time x
# 2 3 3
# 3 3 3
# 4 4 4
# 5 NA NA
one way to hack this is:
# the time points you'd like to use, in order of preference
times = 3:4
# split the data by id,
# for each subset, find values of x for the first time found, or use NA
# combine the subsets back into a single data frame
do.call(rbind, by(data, data$id, function(data)
with(data, {
rows = (time == times[which(times %in% time)[1]])
if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
# id time x
# 2 2 3 3
# 3 3 3 3
# 4 4 4 4
# 5 5 NA NA
with your original data:
data = read.table(header=TRUE, textConnection(open='r', '
id time x
1 1 10
1 2 11
1 3 23
1 4 23
2 2 12
2 3 13
2 4 14
3 1 11
3 3 15
3 4 18
3 5 21
4 2 22
4 3 27
4 6 29'))
times = 3:4
do.call(rbind, by(data, data$id, function(data)
with(data, {
rows = (time == times[which(times %in% time)[1]])
if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
# id time x
# 1 1 3 23
# 2 2 3 13
# 3 3 3 15
# 4 4 3 27
is this what you wanted?
vQ
Wacek Kusnierczyk wrote:
gallon li wrote:
Suppose I have a long format for a longitudinal data id time x 1 1 10 1 2 11 1 3 23 1 4 23 2 2 12 2 3 13 2 4 14 3 1 11 3 3 15 3 4 18 3 5 21 4 2 22 4 3 27 4 6 29 I want to select the x values for each ID when time is equal to 3. When that observation is not observed, then I want to replace it with the obervation at time equal to 4. otherwise just use NA.
with this dummy data:
data = read.table(header=TRUE, textConnection(open='r', '
id time x
2 2 2
2 3 3
2 4 4
2 5 5
3 3 3
3 4 4
3 5 5
4 4 4
4 5 5
5 5 5'))
you seem to expect the result to be like
# id time x
# 2 3 3
# 3 3 3
# 4 4 4
# 5 NA NA
one way to hack this is:
# the time points you'd like to use, in order of preference
times = 3:4
# split the data by id,
# for each subset, find values of x for the first time found, or use NA
# combine the subsets back into a single data frame
do.call(rbind, by(data, data$id, function(data)
with(data, {
rows = (time == times[which(times %in% time)[1]])
if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
# id time x
# 2 2 3 3
# 3 3 3 3
# 4 4 4 4
# 5 5 NA NA
with your original data:
data = read.table(header=TRUE, textConnection(open='r', '
id time x
1 1 10
1 2 11
1 3 23
1 4 23
2 2 12
2 3 13
2 4 14
3 1 11
3 3 15
3 4 18
3 5 21
4 2 22
4 3 27
4 6 29'))
times = 3:4
do.call(rbind, by(data, data$id, function(data)
with(data, {
rows = (time == times[which(times %in% time)[1]])
if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
# id time x
# 1 1 3 23
# 2 2 3 13
# 3 3 3 15
# 4 4 3 27
is this what you wanted?
There's also the straightforward answer:
> sapply(split(data,data$id), function(d) { r <- d$x[d$time==3]
+ if(!length(r)) r <- d$x[d$time==4]
+ if(!length(r)) NA
+ r})
1 2 3 4
23 13 15 27
or, just to checkout the case where time==3 is actually missing:
> sapply(split(data[-c(6,13),],data$id[-c(6,13)]), function(d) {
+ r <- d$x[d$time==3]
+ if(!length(r)) r <- d$x[d$time==4]
+ if(!length(r)) r <- NA
+ r})
1 2 3 4
23 14 15 NA
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Peter Dalgaard wrote:
times = 3:4
do.call(rbind, by(data, data$id, function(data)
with(data, {
rows = (time == times[which(times %in% time)[1]])
if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
# id time x
# 1 1 3 23
# 2 2 3 13
# 3 3 3 15
# 4 4 3 27
is this what you wanted?
There's also the straightforward answer:
sapply(split(data,data$id), function(d) { r <- d$x[d$time==3]
+ if(!length(r)) r <- d$x[d$time==4] + if(!length(r)) NA + r}) 1 2 3 4 23 13 15 27 or, just to checkout the case where time==3 is actually missing:
sapply(split(data[-c(6,13),],data$id[-c(6,13)]), function(d) {
+ r <- d$x[d$time==3] + if(!length(r)) r <- d$x[d$time==4] + if(!length(r)) r <- NA + r}) 1 2 3 4 23 14 15 NA
indeed, and although the output is not a data frame and does not report the time actually used, it should be easy to add this if needed. your solution is more efficient, and if the output is sufficient, it might be preferable. vQ