Read 2 rows in 1 dataframe for diff - longitudinal data - R-help

arun

Mon, Jun 3, 2013 7:10 PM #

Hi,
May be this helps:
res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),]
?res1
#?? subid year var
#3???? 36 2003?? 3
#7???? 47 2001?? 3
#9???? 47 2005?? 1
#10??? 47 2007?? 3
#or
library(plyr)
?subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
#?? subid year var
#3???? 36 2003?? 3
#7???? 47 2001?? 3
#9???? 47 2005?? 1
#10??? 47 2007?? 3
A.K.



I need to output a dataframe whenever var changes a value. 

df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
? ?subid year var 
1 ? ? 36 1999 ? 1 
2 ? ? 36 2001 ? 1 
3 ? ? 36 2003 ? 3 
4 ? ? 36 2005 ? 3 
5 ? ? 36 2007 ? 3 
6 ? ? 47 1999 ? 1 
7 ? ? 47 2001 ? 3 
8 ? ? 47 2003 ? 3 
9 ? ? 47 2005 ? 1 
10 ? ?47 2007 ? 3

I need: 
36 2003 ? 3 
47 2001 ? 3 
47 2005 ? 1 
47 2007 ? 3 

I am trying to use ddply over subid and use the diff function, but it is not working quiet right.

? subid delta 
1 ? ?36 FALSE 
2 ? ?36 ?TRUE 
3 ? ?36 FALSE 
4 ? ?36 FALSE 
5 ? ?47 ?TRUE 
6 ? ?47 FALSE 
7 ? ?47 ?TRUE 
8 ? ?47 ?TRUE 

I would appreciate any help on this. 
Thank You! 
-ST

David Winsemius

Mon, Jun 3, 2013 9:37 PM #

On Jun 3, 2013, at 7:10 PM, arun wrote:

It's pretty simple with logical indexing:

subid year var
3     36 2003   3
6     47 1999   1
7     47 2001   3
9     47 2005   1
10    47 2007   3


When I count the number of changes in value of var is give me 5. Not sure why you are both leaving out row 6.

David.
> 
> 
> I need to output a dataframe whenever var changes a value. 
> 
> df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
>    subid year var 
> 1     36 1999   1 
> 2     36 2001   1 
> 3     36 2003   3 
> 4     36 2005   3 
> 5     36 2007   3 
> 6     47 1999   1 
> 7     47 2001   3 
> 8     47 2003   3 
> 9     47 2005   1 
> 10    47 2007   3 
>> 
> 
> I need: 
> 36 2003   3 
> 47 2001   3 
> 47 2005   1 
> 47 2007   3 
> 
> I am trying to use ddply over subid and use the diff function, but it is not working quiet right. 
> 
>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0) 
>> dd 
>   subid delta 
> 1    36 FALSE 
> 2    36  TRUE 
> 3    36 FALSE 
> 4    36 FALSE 
> 5    47  TRUE 
> 6    47 FALSE 
> 7    47  TRUE 
> 8    47  TRUE 
> 
> I would appreciate any help on this. 
> Thank You! 
> -ST
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

arun

Mon, Jun 3, 2013 9:51 PM #

If it is grouped by "subid" (that would be the difference in the number of changes)

subset(ddply(df1,.(subid),mutate,delta=c(FALSE,var[-1]!=var[-length(var)])),delta)[,-4]
#?? subid year var
#3???? 36 2003?? 3
#7???? 47 2001?? 3
#9???? 47 2005?? 1
#10??? 47 2007?? 3
A.K.


----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 4, 2013 12:37 AM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data

On Jun 3, 2013, at 7:10 PM, arun wrote:

It's pretty simple with logical indexing:

?  subid year var
3? ?  36 2003?  3
6? ?  47 1999?  1
7? ?  47 2001?  3
9? ?  47 2005?  1
10? ? 47 2007?  3


When I count the number of changes in value of var is give me 5. Not sure why you are both leaving out row 6.

David.
> 
> 
> I need to output a dataframe whenever var changes a value. 
> 
> df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
>? ? subid year var 
> 1? ?  36 1999?  1 
> 2? ?  36 2001?  1 
> 3? ?  36 2003?  3 
> 4? ?  36 2005?  3 
> 5? ?  36 2007?  3 
> 6? ?  47 1999?  1 
> 7? ?  47 2001?  3 
> 8? ?  47 2003?  3 
> 9? ?  47 2005?  1 
> 10? ? 47 2007?  3 
>> 
> 
> I need: 
> 36 2003?  3 
> 47 2001?  3 
> 47 2005?  1 
> 47 2007?  3 
> 
> I am trying to use ddply over subid and use the diff function, but it is not working quiet right. 
> 
>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0) 
>> dd 
>?  subid delta 
> 1? ? 36 FALSE 
> 2? ? 36? TRUE 
> 3? ? 36 FALSE 
> 4? ? 36 FALSE 
> 5? ? 47? TRUE 
> 6? ? 47 FALSE 
> 7? ? 47? TRUE 
> 8? ? 47? TRUE 
> 
> I would appreciate any help on this. 
> Thank You! 
> -ST
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA

David Winsemius

Tue, Jun 4, 2013 8:10 AM #

On Jun 3, 2013, at 9:51 PM, arun wrote:

Ah. I see. Then this looks simpler to my eyes:

df1[ ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ) , ]

David.


> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 12:37 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> On Jun 3, 2013, at 7:10 PM, arun wrote:
> 
>> Hi,
>> May be this helps:
>> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),]
>>   res1
>> #   subid year var
>> #3     36 2003   3
>> #7     47 2001   3
>> #9     47 2005   1
>> #10    47 2007   3
>> #or
>> library(plyr)
>>   subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
>> #   subid year var
>> #3     36 2003   3
>> #7     47 2001   3
>> #9     47 2005   1
>> #10    47 2007   3
>> A.K.
>> 
> It's pretty simple with logical indexing:
> 
>> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
>    subid year var
> 3     36 2003   3
> 6     47 1999   1
> 7     47 2001   3
> 9     47 2005   1
> 10    47 2007   3
> 
> 
> When I count the number of changes in value of var is give me 5. Not sure why you are both leaving out row 6.
> 
> -- 
> David.
>> 
>> 
>> I need to output a dataframe whenever var changes a value. 
>> 
>> df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
>>     subid year var 
>> 1     36 1999   1 
>> 2     36 2001   1 
>> 3     36 2003   3 
>> 4     36 2005   3 
>> 5     36 2007   3 
>> 6     47 1999   1 
>> 7     47 2001   3 
>> 8     47 2003   3 
>> 9     47 2005   1 
>> 10    47 2007   3 
>>> 
>> 
>> I need: 
>> 36 2003   3 
>> 47 2001   3 
>> 47 2005   1 
>> 47 2007   3 
>> 
>> I am trying to use ddply over subid and use the diff function, but it is not working quiet right. 
>> 
>>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0) 
>>> dd 
>>    subid delta 
>> 1    36 FALSE 
>> 2    36  TRUE 
>> 3    36 FALSE 
>> 4    36 FALSE 
>> 5    47  TRUE 
>> 6    47 FALSE 
>> 7    47  TRUE 
>> 8    47  TRUE 
>> 
>> I would appreciate any help on this. 
>> Thank You! 
>> -ST
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 

David Winsemius
Alameda, CA, USA

David Winsemius

Tue, Jun 4, 2013 8:13 AM #

On Jun 3, 2013, at 9:51 PM, arun wrote:

I'm not sure why the first one retruns integer values from the ave() call but the second version works:

subid year var
1      36 1999   1
1.1    36 1999   1
1.2    36 1999   1
1.3    36 1999   1

ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
 [1] 0 0 1 0 0 0 1 0 1 1

Perhaps one of the single item groups sabotaged my simple function.

subid year var
3     36 2003   3
7     47 2001   3
9     47 2005   1
10    47 2007   3

David.
> 
> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 12:37 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> On Jun 3, 2013, at 7:10 PM, arun wrote:
> 
>> Hi,
>> May be this helps:
>> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),]
>>   res1
>> #   subid year var
>> #3     36 2003   3
>> #7     47 2001   3
>> #9     47 2005   1
>> #10    47 2007   3
>> #or
>> library(plyr)
>>   subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
>> #   subid year var
>> #3     36 2003   3
>> #7     47 2001   3
>> #9     47 2005   1
>> #10    47 2007   3
>> A.K.
>> 
> It's pretty simple with logical indexing:
> 
>> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
>    subid year var
> 3     36 2003   3
> 6     47 1999   1
> 7     47 2001   3
> 9     47 2005   1
> 10    47 2007   3
> 
> 
> When I count the number of changes in value of var is give me 5. Not sure why you are both leaving out row 6.
> 
> -- 
> David.
>> 
>> 
>> I need to output a dataframe whenever var changes a value. 
>> 
>> df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
>>     subid year var 
>> 1     36 1999   1 
>> 2     36 2001   1 
>> 3     36 2003   3 
>> 4     36 2005   3 
>> 5     36 2007   3 
>> 6     47 1999   1 
>> 7     47 2001   3 
>> 8     47 2003   3 
>> 9     47 2005   1 
>> 10    47 2007   3 
>>> 
>> 
>> I need: 
>> 36 2003   3 
>> 47 2001   3 
>> 47 2005   1 
>> 47 2007   3 
>> 
>> I am trying to use ddply over subid and use the diff function, but it is not working quiet right. 
>> 
>>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0) 
>>> dd 
>>    subid delta 
>> 1    36 FALSE 
>> 2    36  TRUE 
>> 3    36 FALSE 
>> 4    36 FALSE 
>> 5    47  TRUE 
>> 6    47 FALSE 
>> 7    47  TRUE 
>> 8    47  TRUE 
>> 
>> I would appreciate any help on this. 
>> Thank You! 
>> -ST
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 

David Winsemius
Alameda, CA, USA

arun

Tue, Jun 4, 2013 10:18 AM #

Hi,

By comparing some of the solutions:
?set.seed(25)
?subid<- sample(30:50,22e5,replace=TRUE)
set.seed(27)
year<- sample(1990:2012,22e5,replace=TRUE)
set.seed(35)
?var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
df2<- data.frame(subid,year,var1)
df2<- df2[order(df2$subid,df2$year),]
system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-length(var1)])),delta)[,-4]) 
#? user? system elapsed 
?# 8.036?? 0.132?? 8.188 

system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ) ), ])
#? user? system elapsed 
?# 1.220?? 0.000?? 1.222 
system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),])
#? user? system elapsed 
?# 1.729?? 0.000?? 1.730 
identical(res2,res3)
#[1] TRUE

row.names(res)<-1:nrow(res)
?row.names(res2)<-1:nrow(res)
?identical(res,res2)
#[1] TRUE

I found half an hour a bit too extreme by comparing the above numbers.


A.K.


David: 

6 ? ? 47 1999 ? 1 

should not be included in the output list because, we are trying
 to detect changes within the subid's. ?1999 was the first year for 
subject 47 and changes have to be detected after that year - hence we 
were using ddply to group. Your solution ran very fast as expected. 

AK- I have a large dataset and your solution is taking too long -
 as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset. 

Thanks for the suggestions. 

-ST 


----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 4, 2013 11:13 AM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data

On Jun 3, 2013, at 9:51 PM, arun wrote:

I'm not sure why the first one retruns integer values from the ave() call but the second version works:

? ? subid year var
1? ? ? 36 1999?  1
1.1? ? 36 1999?  1
1.2? ? 36 1999?  1
1.3? ? 36 1999?  1

ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
[1] 0 0 1 0 0 0 1 0 1 1

Perhaps one of the single item groups sabotaged my simple function.

?  subid year var
3? ?  36 2003?  3
7? ?  47 2001?  3
9? ?  47 2005?  1
10? ? 47 2007?  3

David.
> 
> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 12:37 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> On Jun 3, 2013, at 7:10 PM, arun wrote:
> 
>> Hi,
>> May be this helps:
>> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),]
>>?  res1
>> #?  subid year var
>> #3? ?  36 2003?  3
>> #7? ?  47 2001?  3
>> #9? ?  47 2005?  1
>> #10? ? 47 2007?  3
>> #or
>> library(plyr)
>>?  subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
>> #?  subid year var
>> #3? ?  36 2003?  3
>> #7? ?  47 2001?  3
>> #9? ?  47 2005?  1
>> #10? ? 47 2007?  3
>> A.K.
>> 
> It's pretty simple with logical indexing:
> 
>> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
>? ? subid year var
> 3? ?  36 2003?  3
> 6? ?  47 1999?  1
> 7? ?  47 2001?  3
> 9? ?  47 2005?  1
> 10? ? 47 2007?  3
> 
> 
> When I count the number of changes in value of var is give me 5. Not sure why you are both leaving out row 6.
> 
> -- 
> David.
>> 
>> 
>> I need to output a dataframe whenever var changes a value. 
>> 
>> df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
>>? ?  subid year var 
>> 1? ?  36 1999?  1 
>> 2? ?  36 2001?  1 
>> 3? ?  36 2003?  3 
>> 4? ?  36 2005?  3 
>> 5? ?  36 2007?  3 
>> 6? ?  47 1999?  1 
>> 7? ?  47 2001?  3 
>> 8? ?  47 2003?  3 
>> 9? ?  47 2005?  1 
>> 10? ? 47 2007?  3 
>>> 
>> 
>> I need: 
>> 36 2003?  3 
>> 47 2001?  3 
>> 47 2005?  1 
>> 47 2007?  3 
>> 
>> I am trying to use ddply over subid and use the diff function, but it is not working quiet right. 
>> 
>>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0) 
>>> dd 
>>? ? subid delta 
>> 1? ? 36 FALSE 
>> 2? ? 36? TRUE 
>> 3? ? 36 FALSE 
>> 4? ? 36 FALSE 
>> 5? ? 47? TRUE 
>> 6? ? 47 FALSE 
>> 7? ? 47? TRUE 
>> 8? ? 47? TRUE 
>> 
>> I would appreciate any help on this. 
>> Thank You! 
>> -ST
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 

David Winsemius
Alameda, CA, USA

arun

Tue, Jun 4, 2013 12:25 PM #

HI ST,

In case, you wanted to further decrease the time:
library(data.table)
dt1<- data.table(df2) #using the same example as below
system.time({
?dt1<-dt1[,indx:=c(FALSE,diff(var1)!=0),by=subid]
res3<-subset(dt1,indx,select=1:3)
})
# user? system elapsed 
#?? 0.32??? 0.00??? 0.32 
?head(res3)
#?? subid year var1
#1:??? 30 1990??? 7
#2:??? 30 1990??? 1
#3:??? 30 1990??? 5
#4:??? 30 1990??? 7
#5:??? 30 1990??? 5
#6:??? 30 1990??? 7
?head(res2)
#? subid year var1
#1??? 30 1990??? 7
#2??? 30 1990??? 1
#3??? 30 1990??? 5
#4??? 30 1990??? 7
#5??? 30 1990??? 5
#6??? 30 1990??? 7


Since you mentioned this > half-hour running time, it would be good to check your data.? 

?str()
A.K.



----- Original Message -----
From: arun <smartpink111 at yahoo.com>
To: R help <r-help at r-project.org>
Cc: David Winsemius <dwinsemius at comcast.net>
Sent: Tuesday, June 4, 2013 1:18 PM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data



Hi,

By comparing some of the solutions:
?set.seed(25)
?subid<- sample(30:50,22e5,replace=TRUE)
set.seed(27)
year<- sample(1990:2012,22e5,replace=TRUE)
set.seed(35)
?var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
df2<- data.frame(subid,year,var1)
df2<- df2[order(df2$subid,df2$year),]
system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-length(var1)])),delta)[,-4]) 
#? user? system elapsed 
?# 8.036?? 0.132?? 8.188 

system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ) ), ])
#? user? system elapsed 
?# 1.220?? 0.000?? 1.222 
system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),])
#? user? system elapsed 
?# 1.729?? 0.000?? 1.730 
identical(res2,res3)
#[1] TRUE

row.names(res)<-1:nrow(res)
?row.names(res2)<-1:nrow(res)
?identical(res,res2)
#[1] TRUE

I found half an hour a bit too extreme by comparing the above numbers.


A.K.


David: 

6 ? ? 47 1999 ? 1 

should not be included in the output list because, we are trying
to detect changes within the subid's. ?1999 was the first year for 
subject 47 and changes have to be detected after that year - hence we 
were using ddply to group. Your solution ran very fast as expected. 

AK- I have a large dataset and your solution is taking too long -
as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset. 

Thanks for the suggestions. 

-ST 


----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 4, 2013 11:13 AM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data

On Jun 3, 2013, at 9:51 PM, arun wrote:

I'm not sure why the first one retruns integer values from the ave() call but the second version works:

? ? subid year var
1? ? ? 36 1999?? 1
1.1? ? 36 1999?? 1
1.2? ? 36 1999?? 1
1.3? ? 36 1999?? 1

ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
[1] 0 0 1 0 0 0 1 0 1 1

Perhaps one of the single item groups sabotaged my simple function.

?? subid year var
3? ?? 36 2003?? 3
7? ?? 47 2001?? 3
9? ?? 47 2005?? 1
10? ? 47 2007?? 3

David.
> 
> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 12:37 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> On Jun 3, 2013, at 7:10 PM, arun wrote:
> 
>> Hi,
>> May be this helps:
>> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x) c(FALSE,diff(x)!=0)),use.names=FALSE)),]
>>?? res1
>> #?? subid year var
>> #3? ?? 36 2003?? 3
>> #7? ?? 47 2001?? 3
>> #9? ?? 47 2005?? 1
>> #10? ? 47 2007?? 3
>> #or
>> library(plyr)
>>?? subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
>> #?? subid year var
>> #3? ?? 36 2003?? 3
>> #7? ?? 47 2001?? 3
>> #9? ?? 47 2005?? 1
>> #10? ? 47 2007?? 3
>> A.K.
>> 
> It's pretty simple with logical indexing:
> 
>> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
>? ? subid year var
> 3? ?? 36 2003?? 3
> 6? ?? 47 1999?? 1
> 7? ?? 47 2001?? 3
> 9? ?? 47 2005?? 1
> 10? ? 47 2007?? 3
> 
> 
> When I count the number of changes in value of var is give me 5. Not sure why you are both leaving out row 6.
> 
> -- 
> David.
>> 
>> 
>> I need to output a dataframe whenever var changes a value. 
>> 
>> df1 <- data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3,3,1,3)) 
>>? ?? subid year var 
>> 1? ?? 36 1999?? 1 
>> 2? ?? 36 2001?? 1 
>> 3? ?? 36 2003?? 3 
>> 4? ?? 36 2005?? 3 
>> 5? ?? 36 2007?? 3 
>> 6? ?? 47 1999?? 1 
>> 7? ?? 47 2001?? 3 
>> 8? ?? 47 2003?? 3 
>> 9? ?? 47 2005?? 1 
>> 10? ? 47 2007?? 3 
>>> 
>> 
>> I need: 
>> 36 2003?? 3 
>> 47 2001?? 3 
>> 47 2005?? 1 
>> 47 2007?? 3 
>> 
>> I am trying to use ddply over subid and use the diff function, but it is not working quiet right. 
>> 
>>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0) 
>>> dd 
>>? ? subid delta 
>> 1? ? 36 FALSE 
>> 2? ? 36? TRUE 
>> 3? ? 36 FALSE 
>> 4? ? 36 FALSE 
>> 5? ? 47? TRUE 
>> 6? ? 47 FALSE 
>> 7? ? 47? TRUE 
>> 8? ? 47? TRUE 
>> 
>> I would appreciate any help on this. 
>> Thank You! 
>> -ST
>> 
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> David Winsemius
> Alameda, CA, USA
> 

David Winsemius
Alameda, CA, USA

William Dunlap

Tue, Jun 4, 2013 12:25 PM #

Since you have sorted the data.frame by 'subid', breaking ties with 'year',
doesn't the following do the same thing as the other solutions.
  f4 <- function(df) df[ c(TRUE,diff(df$var1)!=0) & c(FALSE,diff(df$subid)==0), ]
It gives the same answer for your df2 and is quicker than the others.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of arun
Sent: Tuesday, June 04, 2013 10:19 AM
To: R help
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data



Hi,

By comparing some of the solutions:
?set.seed(25)
?subid<- sample(30:50,22e5,replace=TRUE)
set.seed(27)
year<- sample(1990:2012,22e5,replace=TRUE)
set.seed(35)
?var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
df2<- data.frame(subid,year,var1)
df2<- df2[order(df2$subid,df2$year),]
system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-
length(var1)])),delta)[,-4])
#? user? system elapsed
?# 8.036?? 0.132?? 8.188

system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE,
x[-1] != x[-length(x)]) ) ), ])
#? user? system elapsed
?# 1.220?? 0.000?? 1.222
system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x)
c(FALSE,diff(x)!=0)),use.names=FALSE)),])
#? user? system elapsed
?# 1.729?? 0.000?? 1.730
identical(res2,res3)
#[1] TRUE

row.names(res)<-1:nrow(res)
?row.names(res2)<-1:nrow(res)
?identical(res,res2)
#[1] TRUE

I found half an hour a bit too extreme by comparing the above numbers.


A.K.


David:

6 ? ? 47 1999 ? 1

should not be included in the output list because, we are trying
 to detect changes within the subid's. ?1999 was the first year for
subject 47 and changes have to be detected after that year - hence we
were using ddply to group. Your solution ran very fast as expected.

AK- I have a large dataset and your solution is taking too long -
 as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset.

Thanks for the suggestions.

-ST


----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 4, 2013 11:13 AM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data


On Jun 3, 2013, at 9:51 PM, arun wrote:

If it is grouped by "subid" (that would be the difference in the number of changes)

subset(ddply(df1,.(subid),mutate,delta=c(FALSE,var[-1]!=var[-length(var)])),delta)[,-4]
#?  subid year var
#3? ?  36 2003?  3
#7? ?  47 2001?  3
#9? ?  47 2005?  1
#10? ? 47 2007?  3
A.K.

I'm not sure why the first one retruns integer values from the ave() call but the second
version works:

df1[ ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ), ]

? ? subid year var
1? ? ? 36 1999?  1
1.1? ? 36 1999?  1
1.2? ? 36 1999?  1
1.3? ? 36 1999?  1

ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
[1] 0 0 1 0 0 0 1 0 1 1

Perhaps one of the single item groups sabotaged my simple function.

df1[ as.logical( ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)])

) ), ]
?  subid year var
3? ?  36 2003?  3
7? ?  47 2001?  3
9? ?  47 2005?  1
10? ? 47 2007?  3

--
David.


----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
To: arun <smartpink111 at yahoo.com>
Cc: R help <r-help at r-project.org>
Sent: Tuesday, June 4, 2013 12:37 AM
Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data


On Jun 3, 2013, at 7:10 PM, arun wrote:

Hi,
May be this helps:
res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x)

c(FALSE,diff(x)!=0)),use.names=FALSE)),]

?  res1
#?  subid year var
#3? ?  36 2003?  3
#7? ?  47 2001?  3
#9? ?  47 2005?  1
#10? ? 47 2007?  3
#or
library(plyr)
?  subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
#?  subid year var
#3? ?  36 2003?  3
#7? ?  47 2001?  3
#9? ?  47 2005?  1
#10? ? 47 2007?  3
A.K.

It's pretty simple with logical indexing:

df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]

? ? subid year var
3? ?  36 2003?  3
6? ?  47 1999?  1
7? ?  47 2001?  3
9? ?  47 2005?  1
10? ? 47 2007?  3


When I count the number of changes in value of var is give me 5. Not sure why you are

both leaving out row 6.

--
David.


I need to output a dataframe whenever var changes a value.

df1 <-

data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3
,3,1,3))

? ?  subid year var
1? ?  36 1999?  1
2? ?  36 2001?  1
3? ?  36 2003?  3
4? ?  36 2005?  3
5? ?  36 2007?  3
6? ?  47 1999?  1
7? ?  47 2001?  3
8? ?  47 2003?  3
9? ?  47 2005?  1
10? ? 47 2007?  3

I need:
36 2003?  3
47 2001?  3
47 2005?  1
47 2007?  3

I am trying to use ddply over subid and use the diff function, but it is not working quiet

right.

dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0)
dd

? ? subid delta
1? ? 36 FALSE
2? ? 36? TRUE
3? ? 36 FALSE
4? ? 36 FALSE
5? ? 47? TRUE
6? ? 47 FALSE
7? ? 47? TRUE
8? ? 47? TRUE

I would appreciate any help on this.
Thank You!
-ST

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.