Read 2 rows in 1 dataframe for diff - longitudinal data

HI ST,

In case, you wanted to further decrease the time:
library(data.table)
dt1<- data.table(df2) #using the same example as below
system.time({
?dt1<-dt1[,indx:=c(FALSE,diff(var1)!=0),by=subid]
res3<-subset(dt1,indx,select=1:3)
})
# user? system elapsed 
#?? 0.32??? 0.00??? 0.32 
?head(res3)
#?? subid year var1
#1:??? 30 1990??? 7
#2:??? 30 1990??? 1
#3:??? 30 1990??? 5
#4:??? 30 1990??? 7
#5:??? 30 1990??? 5
#6:??? 30 1990??? 7
?head(res2)
#? subid year var1
#1??? 30 1990??? 7
#2??? 30 1990??? 1
#3??? 30 1990??? 5
#4??? 30 1990??? 7
#5??? 30 1990??? 5
#6??? 30 1990??? 7

Since you mentioned this > half-hour running time, it would be good to check your data.? 

?str()
A.K.

----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: R help <r-help at r-project.org>
Cc: David Winsemius <dwinsemius at comcast.net>
Sent: Tuesday, June 4, 2013 1:18 PM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data

Hi,

By comparing some of the solutions:
?set.seed(25)
?subid<- sample(30:50,22e5,replace=TRUE)
set.seed(27)
year<- sample(1990:2012,22e5,replace=TRUE)
set.seed(35)
?var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
df2<- data.frame(subid,year,var1)
df2<- df2[order(df2$subid,df2$year),]
system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-length(var1)])),delta)[,-4]) 
#? user? system elapsed 
?# 8.036?? 0.132?? 8.188 

system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ) ), ])
#? user? system elapsed 
?# 1.220?? 0.000?? 1.222 
system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),])
#? user? system elapsed 
?# 1.729?? 0.000?? 1.730 
identical(res2,res3)
#[1] TRUE

row.names(res)<-1:nrow(res)
?row.names(res2)<-1:nrow(res)
?identical(res,res2)
#[1] TRUE

I found half an hour a bit too extreme by comparing the above numbers.

A.K.

David: 

6 ? ? 47 1999 ? 1 

should not be included in the output list because, we are trying
to detect changes within the subid's. ?1999 was the first year for 
subject 47 and changes have to be detected after that year - hence we 
were using ddply to group. Your solution ran very fast as expected. 

AK- I have a large dataset and your solution is taking too long -
as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset. 

Thanks for the suggestions. 

-ST 

----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 4, 2013 11:13 AM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data

Read 2 rows in 1 dataframe for diff - longitudinal data

Thread (8 messages)