select observations from longitudinal data

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090329/29a5ac93/attachment-0002.pl>
Let's tackle the bigger problem of doing this not just for time = 3 but for all times.

First we start with your data frame:
dat
id time  x
1   1    1 10
2   1    2 11
3   1    3 23
4   1    4 23
5   2    2 12
6   2    3 13
7   2    4 14
8   3    1 11
9   3    3 15
10  3    4 18
11  3    5 21
12  4    2 22
13  4    3 27
14  4    6 29

### Now put the data into an id x time matrix, with gaps:
mat <- with(dat, {
+ 	lev_id <- sort(unique(id))
+ 	lev_tm <- sort(unique(time))
+ 	M <- matrix(NA, length(lev_id), length(lev_tm))
+ 	dimnames(M) <- list(id = lev_id, time = lev_tm)
+ 	M[cbind(match(id, lev_id), match(time, lev_tm))] <- x
+ 	M
+   })
mat
time
id   1  2  3  4  5  6
  1 10 11 23 23 NA NA
  2 NA 12 13 14 NA NA
  3 11 NA 15 18 21 NA
  4 NA 22 27 NA NA 29

### Now do the replacements 
### (this is a very questionable dodge, by the way!)
for(i in 2:nrow(mat)) 
+ 	if(any(k <- is.na(mat[i-1, ]))) 
+		mat[i-1, k] <- mat[i, k]
mat
time
id   1  2  3  4  5  6
  1 10 11 23 23 NA NA
  2 11 12 13 14 21 NA
  3 11 22 15 18 21 29
  4 NA 22 27 NA NA 29

### some gaps cannot be filled.
### now turn it back into a data frame;
### this is a splendid trick that no one knows about:
dat <- as.data.frame(as.table(mat), responseName = "x")
dat <- with(dat, dat[order(id, time), ])

### this will look OK, but the first two columns are factors
### as we started with numeric variables (if we did) then
### it might be useful to turn them back to numerical variables
### again:
dat <- within(dat, {
+ 	id <- as.numeric(as.character(id))
+ 	time <- as.numeric(as.character(time))
+   })
dat
id time  x
1   1    1 10
5   1    2 11
9   1    3 23
13  1    4 23
17  1    5 NA
21  1    6 NA
2   2    1 11
6   2    2 12
10  2    3 13
14  2    4 14
18  2    5 21
22  2    6 NA
3   3    1 11
7   3    2 22
11  3    3 15
15  3    4 18
19  3    5 21
23  3    6 29
4   4    1 NA
8   4    2 22
12  4    3 27
16  4    4 NA
20  4    5 NA
24  4    6 29

### As many gaps have been filled as can be filled (with fake data!).  

### If you want to remove those still missing, you can use
dat <- na.omit(dat)
Bill Venables
http://www.cmis.csiro.au/bill.venables/ 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of gallon li
Sent: Sunday, 29 March 2009 6:51 PM
To: r-help
Subject: [R] select observations from longitudinal data

Suppose I have a long format for a longitudinal data

id time x
1 1 10
1 2 11
1 3 23
1 4 23
2 2 12
2 3 13
2 4 14
3 1 11
3 3 15
3 4 18
3 5 21
4 2 22
4 3 27
4 6 29

I want to select the x values for each ID when time is equal to 3. When that
observation is not observed, then I want to replace it with the obervation
at time equal to 4. otherwise just use NA.

How can I implement this with a quick command?

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Suppose I have a long format for a longitudinal data

id time x
1 1 10
1 2 11
1 3 23
1 4 23
2 2 12
2 3 13
2 4 14
3 1 11
3 3 15
3 4 18
3 5 21
4 2 22
4 3 27
4 6 29

I want to select the x values for each ID when time is equal to 3. When that
observation is not observed, then I want to replace it with the obervation
at time equal to 4. otherwise just use NA.

with this dummy data:

    data = read.table(header=TRUE, textConnection(open='r', '
        id time x      
        2 2 2
        2 3 3
        2 4 4
        2 5 5
        3 3 3
        3 4 4
        3 5 5
        4 4 4
        4 5 5
        5 5 5'))

you seem to expect the result to be like

    # id time x
    # 2 3 3
    # 3 3 3
    # 4 4 4
    # 5 NA NA

one way to hack this is:

    # the time points you'd like to use, in order of preference
    times = 3:4

    # split the data by id,
    # for each subset, find values of x for the first time found, or use NA
    # combine the subsets back into a single data frame
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
    #   id time  x
    # 2  2    3  3
    # 3  3    3  3
    # 4  4    4  4
    # 5  5   NA NA

with your original data:

    data = read.table(header=TRUE, textConnection(open='r', '
       id time x
       1 1 10
       1 2 11
       1 3 23
       1 4 23
       2 2 12
       2 3 13
       2 4 14
       3 1 11
       3 3 15
       3 4 18
       3 5 21
       4 2 22
       4 3 27
       4 6 29'))
    times = 3:4
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))

    #   id time  x
    # 1  1    3 23
    # 2  2    3 13
    # 3  3    3 15
    # 4  4    3 27

is this what you wanted?

vQ
gallon li wrote:
Suppose I have a long format for a longitudinal data

id time x
1 1 10
1 2 11
1 3 23
1 4 23
2 2 12
2 3 13
2 4 14
3 1 11
3 3 15
3 4 18
3 5 21
4 2 22
4 3 27
4 6 29

I want to select the x values for each ID when time is equal to 3. When that
observation is not observed, then I want to replace it with the obervation
at time equal to 4. otherwise just use NA.

with this dummy data:

    data = read.table(header=TRUE, textConnection(open='r', '
        id time x      
        2 2 2
        2 3 3
        2 4 4
        2 5 5
        3 3 3
        3 4 4
        3 5 5
        4 4 4
        4 5 5
        5 5 5'))

you seem to expect the result to be like

    # id time x
    # 2 3 3
    # 3 3 3
    # 4 4 4
    # 5 NA NA

one way to hack this is:

    # the time points you'd like to use, in order of preference
    times = 3:4

    # split the data by id,
    # for each subset, find values of x for the first time found, or use NA
    # combine the subsets back into a single data frame
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
    #   id time  x
    # 2  2    3  3
    # 3  3    3  3
    # 4  4    4  4
    # 5  5   NA NA

with your original data:

    data = read.table(header=TRUE, textConnection(open='r', '
       id time x
       1 1 10
       1 2 11
       1 3 23
       1 4 23
       2 2 12
       2 3 13
       2 4 14
       3 1 11
       3 3 15
       3 4 18
       3 5 21
       4 2 22
       4 3 27
       4 6 29'))
    times = 3:4
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))

    #   id time  x
    # 1  1    3 23
    # 2  2    3 13
    # 3  3    3 15
    # 4  4    3 27

is this what you wanted?
There's also the straightforward answer:

 > sapply(split(data,data$id), function(d) { r <- d$x[d$time==3]
+    if(!length(r)) r <- d$x[d$time==4]
+    if(!length(r)) NA
+    r})
  1  2  3  4
23 13 15 27

or, just to checkout the case where time==3 is actually missing:

 > sapply(split(data[-c(6,13),],data$id[-c(6,13)]), function(d) {
+    r <- d$x[d$time==3]
+    if(!length(r)) r <- d$x[d$time==4]
+    if(!length(r)) r <- NA
+    r})
  1  2  3  4
23 14 15 NA
O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907

    times = 3:4
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))

    #   id time  x
    # 1  1    3 23
    # 2  2    3 13
    # 3  3    3 15
    # 4  4    3 27

is this what you wanted?
There's also the straightforward answer:

sapply(split(data,data$id), function(d) { r <- d$x[d$time==3]
+    if(!length(r)) r <- d$x[d$time==4]
+    if(!length(r)) NA
+    r})
 1  2  3  4
23 13 15 27

or, just to checkout the case where time==3 is actually missing:

sapply(split(data[-c(6,13),],data$id[-c(6,13)]), function(d) {
+    r <- d$x[d$time==3]
+    if(!length(r)) r <- d$x[d$time==4]
+    if(!length(r)) r <- NA
+    r})
 1  2  3  4
23 14 15 NA
indeed, and although the output is not a data frame and does not report
the time actually used, it should be easy to add this if needed.  your
solution is more efficient, and if the output is sufficient, it might be
preferable.

vQ