Optimize for loop / find last record for each person - R-help

Fri, Feb 27, 2009 1:10 PM #

Andrew, it makes it easier to help if you supply a typical
input and expected output along with your code.  I tried
your code with the following input:

person_id date
1       Mary    1
2       Mary    2
3        Sue    3
4       Alex    4
5        Joe    5
6       Alex    6
7       Alex    7
8        Sue    8
9        Sue    9
10       Joe   10
made with the function
f<-function(n){
   cached.rs <- .Random.seed
   on.exit(.Random.seed<<-cached.rs)
   set.seed(1)
   data.frame(person_id=sample(c("Joe","Mary","Sue","Alex"),
        size=n,replace=TRUE), date=seq_len(n))
}
and it failed because there was no column called 'order'.

The following function, f2, does what I think you are saying
you want.  It sorts the data by person_id, breaking ties with
date, and then selects the rows where the person_id entry does
not match the person_id entry in the next row.  It then sorts
the result by date.  (I don't know if the last sort it needed
in your application.)  It should be pretty quick for long
datasets with lots of distinct person_id values.

f2 <-function (history)
{
   # assume history has, at least, columns called "person_id" and "date"
   # Return rows containing the last entry (by date) for each person.
   last <- function(x) c(x[-1]!=x[-length(x)], TRUE)
   history <- history[with(history, order(person_id,date)),,drop=FALSE]
   history <- history[last(history[,"person_id"]),,drop=FALSE]
   history[order(history$date),,drop=FALSE]
}

person_id date
2       Mary    2
7       Alex    7
9        Sue    9
10       Joe   10

Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com

---------------------------------------------------------------
[R] Optimize for loop / find last record for each person

Andrew Ziem ahz001 at gmail.com 
Fri Feb 27 20:02:31 CET 2009

I want to find the last record for each person_id in a data frame
(from a SQL database) ordered by date.  Is there a better way than
this for loop?

for (i in 2:length(history[,1])) {
    if (history[i, "person_id"] == history[i - 1, "person_id"])
      history[i, "order"] = history[i - 1, "order"] + 1 # same person
    else
      history[i, "order"] = 1 # new person
}

# ignore all records except the last for each con_id
history2 <- subset(history, order == 1)


Andrew

Andrew Ziem

Fri, Feb 27, 2009 2:47 PM #

On Fri, Feb 27, 2009 at 2:10 PM, William Dunlap <wdunlap at tibco.com> wrote:

I'll be careful to avoid these mistakes.  Also, I should not have used
a reserved word for the variable history, and I should have mentioned
the data is sorted with the most recent dates first. Talk about a bad
day! :)

Originally I omitted this code before the for loop:

history["order"] <- NA
history[1,"order"] = 1

Here's a sample data set:
history_ <- data.frame(person_id=list(c(1,2,2)),date_=list(c("2009-01-01","2009-02-03","2009-02-02")),
x=list(c(0.01,0.05,0.06)) )
colnames(history_) <- c("person_id", "date_","x")
history_

Jorge's suggestion[1] works for me, and it seems much faster.  I
simply adapted it by replacing Jorge's variable x with a sequential
identifier already in the database.
[1] https://stat.ethz.ch/pipermail/r-help/2009-February/189981.html

My data is already sorted by the SQL database like this
 ORDER BY person_id, date_ DESC

Thanks everyone for responding and expanding my knowledge of R!


Best regards,
Andrew

Jorge Ivan Velez

Fri, Feb 27, 2009 3:03 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090227/e20d2f82/attachment-0001.pl>