Skip to content

select observations from longitudinal data

5 messages · gallon li, Bill Venables, Peter Dalgaard +1 more

#
Let's tackle the bigger problem of doing this not just for time = 3 but for all times.

First we start with your data frame:
id time  x
1   1    1 10
2   1    2 11
3   1    3 23
4   1    4 23
5   2    2 12
6   2    3 13
7   2    4 14
8   3    1 11
9   3    3 15
10  3    4 18
11  3    5 21
12  4    2 22
13  4    3 27
14  4    6 29
### Now put the data into an id x time matrix, with gaps:
+ 	lev_id <- sort(unique(id))
+ 	lev_tm <- sort(unique(time))
+ 	M <- matrix(NA, length(lev_id), length(lev_tm))
+ 	dimnames(M) <- list(id = lev_id, time = lev_tm)
+ 	M[cbind(match(id, lev_id), match(time, lev_tm))] <- x
+ 	M
+   })
time
id   1  2  3  4  5  6
  1 10 11 23 23 NA NA
  2 NA 12 13 14 NA NA
  3 11 NA 15 18 21 NA
  4 NA 22 27 NA NA 29
### Now do the replacements 
### (this is a very questionable dodge, by the way!)
+ 	if(any(k <- is.na(mat[i-1, ]))) 
+		mat[i-1, k] <- mat[i, k]
time
id   1  2  3  4  5  6
  1 10 11 23 23 NA NA
  2 11 12 13 14 21 NA
  3 11 22 15 18 21 29
  4 NA 22 27 NA NA 29
### some gaps cannot be filled.
### now turn it back into a data frame;
### this is a splendid trick that no one knows about:
### this will look OK, but the first two columns are factors
### as we started with numeric variables (if we did) then
### it might be useful to turn them back to numerical variables
### again:
+ 	id <- as.numeric(as.character(id))
+ 	time <- as.numeric(as.character(time))
+   })
id time  x
1   1    1 10
5   1    2 11
9   1    3 23
13  1    4 23
17  1    5 NA
21  1    6 NA
2   2    1 11
6   2    2 12
10  2    3 13
14  2    4 14
18  2    5 21
22  2    6 NA
3   3    1 11
7   3    2 22
11  3    3 15
15  3    4 18
19  3    5 21
23  3    6 29
4   4    1 NA
8   4    2 22
12  4    3 27
16  4    4 NA
20  4    5 NA
24  4    6 29
### As many gaps have been filled as can be filled (with fake data!).  

### If you want to remove those still missing, you can use
Bill Venables
http://www.cmis.csiro.au/bill.venables/ 


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of gallon li
Sent: Sunday, 29 March 2009 6:51 PM
To: r-help
Subject: [R] select observations from longitudinal data

Suppose I have a long format for a longitudinal data

id time x
1 1 10
1 2 11
1 3 23
1 4 23
2 2 12
2 3 13
2 4 14
3 1 11
3 3 15
3 4 18
3 5 21
4 2 22
4 3 27
4 6 29

I want to select the x values for each ID when time is equal to 3. When that
observation is not observed, then I want to replace it with the obervation
at time equal to 4. otherwise just use NA.

How can I implement this with a quick command?


______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
gallon li wrote:
with this dummy data:

    data = read.table(header=TRUE, textConnection(open='r', '
        id time x      
        2 2 2
        2 3 3
        2 4 4
        2 5 5
        3 3 3
        3 4 4
        3 5 5
        4 4 4
        4 5 5
        5 5 5'))

you seem to expect the result to be like

    # id time x
    # 2 3 3
    # 3 3 3
    # 4 4 4
    # 5 NA NA

one way to hack this is:

    # the time points you'd like to use, in order of preference
    times = 3:4

    # split the data by id,
    # for each subset, find values of x for the first time found, or use NA
    # combine the subsets back into a single data frame
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))
    #   id time  x
    # 2  2    3  3
    # 3  3    3  3
    # 4  4    4  4
    # 5  5   NA NA

with your original data:

    data = read.table(header=TRUE, textConnection(open='r', '
       id time x
       1 1 10
       1 2 11
       1 3 23
       1 4 23
       2 2 12
       2 3 13
       2 4 14
       3 1 11
       3 3 15
       3 4 18
       3 5 21
       4 2 22
       4 3 27
       4 6 29'))
    times = 3:4
    do.call(rbind, by(data, data$id, function(data)
        with(data, {
            rows = (time == times[which(times %in% time)[1]])
            if (is.na(rows[1])) data.frame(id=id, time=NA, x=NA) else
data[rows,] })))

    #   id time  x
    # 1  1    3 23
    # 2  2    3 13
    # 3  3    3 15
    # 4  4    3 27

is this what you wanted?

vQ
#
Wacek Kusnierczyk wrote:
There's also the straightforward answer:

 > sapply(split(data,data$id), function(d) { r <- d$x[d$time==3]
+    if(!length(r)) r <- d$x[d$time==4]
+    if(!length(r)) NA
+    r})
  1  2  3  4
23 13 15 27

or, just to checkout the case where time==3 is actually missing:

 > sapply(split(data[-c(6,13),],data$id[-c(6,13)]), function(d) {
+    r <- d$x[d$time==3]
+    if(!length(r)) r <- d$x[d$time==4]
+    if(!length(r)) r <- NA
+    r})
  1  2  3  4
23 14 15 NA
#
Peter Dalgaard wrote:
indeed, and although the output is not a data frame and does not report
the time actually used, it should be easy to add this if needed.  your
solution is more efficient, and if the output is sufficient, it might be
preferable.

vQ