Andrew, it makes it easier to help if you supply a typical
input and expected output along with your code. I tried
your code with the following input:
history
person_id date
1 Mary 1
2 Mary 2
3 Sue 3
4 Alex 4
5 Joe 5
6 Alex 6
7 Alex 7
8 Sue 8
9 Sue 9
10 Joe 10
made with the function
f<-function(n){
cached.rs <- .Random.seed
on.exit(.Random.seed<<-cached.rs)
set.seed(1)
data.frame(person_id=sample(c("Joe","Mary","Sue","Alex"),
size=n,replace=TRUE), date=seq_len(n))
}
and it failed because there was no column called 'order'.
The following function, f2, does what I think you are saying
you want. It sorts the data by person_id, breaking ties with
date, and then selects the rows where the person_id entry does
not match the person_id entry in the next row. It then sorts
the result by date. (I don't know if the last sort it needed
in your application.) It should be pretty quick for long
datasets with lots of distinct person_id values.
f2 <-function (history)
{
# assume history has, at least, columns called "person_id" and "date"
# Return rows containing the last entry (by date) for each person.
last <- function(x) c(x[-1]!=x[-length(x)], TRUE)
history <- history[with(history, order(person_id,date)),,drop=FALSE]
history <- history[last(history[,"person_id"]),,drop=FALSE]
history[order(history$date),,drop=FALSE]
}
f2(history)
person_id date
2 Mary 2
7 Alex 7
9 Sue 9
10 Joe 10
Bill Dunlap
TIBCO Software Inc - Spotfire Division
wdunlap tibco.com
---------------------------------------------------------------
[R] Optimize for loop / find last record for each person
Andrew Ziem ahz001 at gmail.com
Fri Feb 27 20:02:31 CET 2009
I want to find the last record for each person_id in a data frame
(from a SQL database) ordered by date. Is there a better way than
this for loop?
for (i in 2:length(history[,1])) {
if (history[i, "person_id"] == history[i - 1, "person_id"])
history[i, "order"] = history[i - 1, "order"] + 1 # same person
else
history[i, "order"] = 1 # new person
}
# ignore all records except the last for each con_id
history2 <- subset(history, order == 1)
Andrew
On Fri, Feb 27, 2009 at 2:10 PM, William Dunlap <wdunlap at tibco.com> wrote:
Andrew, it makes it easier to help if you supply a typical
input and expected output along with your code. ?I tried
your code with the following input:
I'll be careful to avoid these mistakes. Also, I should not have used
a reserved word for the variable history, and I should have mentioned
the data is sorted with the most recent dates first. Talk about a bad
day! :)
Originally I omitted this code before the for loop:
history["order"] <- NA
history[1,"order"] = 1
Here's a sample data set:
history_ <- data.frame(person_id=list(c(1,2,2)),date_=list(c("2009-01-01","2009-02-03","2009-02-02")),
x=list(c(0.01,0.05,0.06)) )
colnames(history_) <- c("person_id", "date_","x")
history_
Jorge's suggestion[1] works for me, and it seems much faster. I
simply adapted it by replacing Jorge's variable x with a sequential
identifier already in the database.
[1] https://stat.ethz.ch/pipermail/r-help/2009-February/189981.html
The following function, f2, does what I think you are saying
you want. It sorts the data by person_id, breaking ties with
date, and then selects the rows where the person_id entry does
My data is already sorted by the SQL database like this
ORDER BY person_id, date_ DESC
Thanks everyone for responding and expanding my knowledge of R!
Best regards,
Andrew