Skip to content

Selecting n observation

4 messages · bibek sharma, Peter Ehlers, David Winsemius +1 more

#
Hello R help,
 I have a question similar to what is posted by someone before. my
problem is that Instead of last assessment, I want to choose last two.

I have a data set with several time assessments for each participant.
I want to select the last assessment for each participant. My dataset
looks like this:
ID  week  outcome
1   2   14
1   4   28
1   6   42
4   2   14
4   6   46
4   9   64
4   9   71
4  12   85
9   2   14
9   4   28
9   6   51
9   9   66
9  12   84

Here is one solution for choosing last assessment
do.call("rbind",
        by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
  ID week outcome
1  1    6      42
4  4   12      85
9  9   12      84
#
On 2012-10-11 12:48, bibek sharma wrote:
With the plyr package:

   library(plyr)
   ddply(df, .(ID), function(x) tail(x, 2))

or, slightly simpler:

   ddply(df, .(ID), tail, 2)

Peter Ehlers
#
On Oct 11, 2012, at 12:48 PM, bibek sharma wrote:

            
Why wouldn't the solution be something along the lines of:

do.call("rbind",
       by(df, INDICES=df$ID, FUN=function(DF) tail(DF, 2) ))
David Winsemius, MD
Alameda, CA, USA
#
Another way to approach this sort of problem is to use ave() to
assign a within-group sequence number to each row and then
select the rows with the sequence numbers you want.  You can
also use ave() to make a column giving the size of the group that
each item is in.  Hence you can select things like "the last 2 items
in each category that had at least 3 items".

E.g., here is a function to generate data on visits of patients to
a clinic, where the visits are listed in time order.

makeData <- function(nVisits, Doctors=paste("Dr.",LETTERS[1:2]), Patients=101:104, seed = 1)
{
    if (!is.null(seed)) set.seed(seed)
    data.frame(Doctor=sample(Doctors, replace=TRUE, nVisits),
               Patient=sample(Patients, replace=TRUE, nVisits),
               Date=as.Date("2004-01-01")+sort(sample(2000, replace=TRUE, nVisits)))
}
# Make a 12-row dataset
d <- makeData(12)
# Add columns describing the visits between each doctor/patient pair
d1 <- within(d, { N=ave(integer(length(Date)), Doctor, Patient, FUN=length)
                    Seq=ave(integer(length(Date)), Doctor, Patient, FUN=seq_along)})
d1
#    Doctor Patient       Date Seq N
# 1   Dr. A     103 2004-01-28   1 3
# 2   Dr. A     102 2005-01-08   1 1
# 3   Dr. B     104 2005-06-19   1 4
# 4   Dr. B     102 2005-11-12   1 2
# 5   Dr. A     103 2006-02-04   2 3
# 6   Dr. B     104 2006-02-12   2 4
# 7   Dr. B     102 2006-08-23   2 2
# 8   Dr. B     104 2006-09-15   3 4
# 9   Dr. B     104 2007-04-15   4 4
# 10  Dr. A     101 2007-08-30   1 2
# 11  Dr. A     103 2008-07-13   3 3
# 12  Dr. A     101 2008-10-06   2 2

# Show the last visit in each doctor/patient group
d[d1$Seq==d1$N, ]
#    Doctor Patient       Date
# 2   Dr. A     102 2005-01-08
# 7   Dr. B     102 2006-08-23
# 9   Dr. B     104 2007-04-15
# 11  Dr. A     103 2008-07-13
# 12  Dr. A     101 2008-10-06

# Show last 2 visits, but only if there were at least 2 visits
d[d1$Seq>d1$N-2 & d1$N>=2, ]
#    Doctor Patient       Date
# 4   Dr. B     102 2005-11-12
# 5   Dr. A     103 2006-02-04
# 7   Dr. B     102 2006-08-23
# 8   Dr. B     104 2006-09-15
# 9   Dr. B     104 2007-04-15
# 10  Dr. A     101 2007-08-30
# 11  Dr. A     103 2008-07-13
# 12  Dr. A     101 2008-10-06

# Show the amount of time beteen the last two visits in a group (if there were at least 2 visits)
d[d1$Seq==d1$N & d1$N>=2, "Date"] - d[d1$Seq==d1$N-1 & d1$N>=2, "Date"]
# Time differences in days
# [1] 284 435 667 403

I find it easier to formulate the queries with this method.  For large
datasets, selecting rows according a criterion can be a lot
faster than splitting a data.frame into many parts, processing
them with tail, and combining them again.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com