Back to formatted view
Raw Message

Message-ID: <E66794E69CFDE04D9A70842786030B931C2FCF8F@PA-MBX01.na.tibco.com>
Date: 2013-06-04T19:25:47Z
From: William Dunlap
Subject: Read 2 rows in 1 dataframe for diff - longitudinal data
In-Reply-To: <1370366315.67454.YahooMailNeo@web142601.mail.bf1.yahoo.com>

Since you have sorted the data.frame by 'subid', breaking ties with 'year',
doesn't the following do the same thing as the other solutions.
  f4 <- function(df) df[ c(TRUE,diff(df$var1)!=0) & c(FALSE,diff(df$subid)==0), ]
It gives the same answer for your df2 and is quicker than the others.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of arun
> Sent: Tuesday, June 04, 2013 10:19 AM
> To: R help
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> 
> Hi,
> 
> By comparing some of the solutions:
> ?set.seed(25)
> ?subid<- sample(30:50,22e5,replace=TRUE)
> set.seed(27)
> year<- sample(1990:2012,22e5,replace=TRUE)
> set.seed(35)
> ?var1<- sample(c(1,3,5,7),22e5,replace=TRUE)
> df2<- data.frame(subid,year,var1)
> df2<- df2[order(df2$subid,df2$year),]
> system.time(res<-subset(ddply(df2,.(subid),mutate,delta=c(FALSE,var1[-1]!=var1[-
> length(var1)])),delta)[,-4])
> #? user? system elapsed
> ?# 8.036?? 0.132?? 8.188
> 
> system.time(res2<-df2[ as.logical( ave( df2$var1, df2$subid, FUN=function(x) c( FALSE,
> x[-1] != x[-length(x)]) ) ), ])
> #? user? system elapsed
> ?# 1.220?? 0.000?? 1.222
> system.time(res3<-df2[with(df2,unlist(tapply(var1,list(subid),FUN=function(x)
> c(FALSE,diff(x)!=0)),use.names=FALSE)),])
> #? user? system elapsed
> ?# 1.729?? 0.000?? 1.730
> identical(res2,res3)
> #[1] TRUE
> 
> row.names(res)<-1:nrow(res)
> ?row.names(res2)<-1:nrow(res)
> ?identical(res,res2)
> #[1] TRUE
> 
> I found half an hour a bit too extreme by comparing the above numbers.
> 
> 
> A.K.
> 
> 
> David:
> 
> 6 ? ? 47 1999 ? 1
> 
> should not be included in the output list because, we are trying
>  to detect changes within the subid's. ?1999 was the first year for
> subject 47 and changes have to be detected after that year - hence we
> were using ddply to group. Your solution ran very fast as expected.
> 
> AK- I have a large dataset and your solution is taking too long -
>  as a matter of fact i had to kill it afte 1/2 hr on a 22K row dataset.
> 
> Thanks for the suggestions.
> 
> -ST
> 
> 
> ----- Original Message -----
> From: David Winsemius <dwinsemius at comcast.net>
> To: arun <smartpink111 at yahoo.com>
> Cc: R help <r-help at r-project.org>
> Sent: Tuesday, June 4, 2013 11:13 AM
> Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> 
> 
> On Jun 3, 2013, at 9:51 PM, arun wrote:
> 
> > If it is grouped by "subid" (that would be the difference in the number of changes)
> >
> > subset(ddply(df1,.(subid),mutate,delta=c(FALSE,var[-1]!=var[-length(var)])),delta)[,-4]
> > #?  subid year var
> > #3? ?  36 2003?  3
> > #7? ?  47 2001?  3
> > #9? ?  47 2005?  1
> > #10? ? 47 2007?  3
> > A.K.
> 
> I'm not sure why the first one retruns integer values from the ave() call but the second
> version works:
> 
> > df1[ ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]) ), ]
> ? ? subid year var
> 1? ? ? 36 1999?  1
> 1.1? ? 36 1999?  1
> 1.2? ? 36 1999?  1
> 1.3? ? 36 1999?  1
> 
> ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)]))
> [1] 0 0 1 0 0 0 1 0 1 1
> 
> Perhaps one of the single item groups sabotaged my simple function.
> 
> 
> > df1[ as.logical( ave( df1$var, df1$subid, FUN=function(x) c( FALSE, x[-1] != x[-length(x)])
> ) ), ]
> ?  subid year var
> 3? ?  36 2003?  3
> 7? ?  47 2001?  3
> 9? ?  47 2005?  1
> 10? ? 47 2007?  3
> 
> --
> David.
> >
> >
> > ----- Original Message -----
> > From: David Winsemius <dwinsemius at comcast.net>
> > To: arun <smartpink111 at yahoo.com>
> > Cc: R help <r-help at r-project.org>
> > Sent: Tuesday, June 4, 2013 12:37 AM
> > Subject: Re: [R] Read 2 rows in 1 dataframe for diff - longitudinal data
> >
> >
> > On Jun 3, 2013, at 7:10 PM, arun wrote:
> >
> >> Hi,
> >> May be this helps:
> >> res1<-df1[with(df1,unlist(tapply(var,list(subid),FUN=function(x)
> c(FALSE,diff(x)!=0)),use.names=FALSE)),]
> >>?  res1
> >> #?  subid year var
> >> #3? ?  36 2003?  3
> >> #7? ?  47 2001?  3
> >> #9? ?  47 2005?  1
> >> #10? ? 47 2007?  3
> >> #or
> >> library(plyr)
> >>?  subset(ddply(df1,.(subid),mutate,delta=c(FALSE,diff(var)!=0)),delta)[,-4]
> >> #?  subid year var
> >> #3? ?  36 2003?  3
> >> #7? ?  47 2001?  3
> >> #9? ?  47 2005?  1
> >> #10? ? 47 2007?  3
> >> A.K.
> >>
> > It's pretty simple with logical indexing:
> >
> >> df1[ c(FALSE, df1$var[-1]!=df1$var[-length(df1$var)]), ]
> >? ? subid year var
> > 3? ?  36 2003?  3
> > 6? ?  47 1999?  1
> > 7? ?  47 2001?  3
> > 9? ?  47 2005?  1
> > 10? ? 47 2007?  3
> >
> >
> > When I count the number of changes in value of var is give me 5. Not sure why you are
> both leaving out row 6.
> >
> > --
> > David.
> >>
> >>
> >> I need to output a dataframe whenever var changes a value.
> >>
> >> df1 <-
> data.frame(subid=rep(c(36,47),each=5),year=rep(seq(1999,2007,2),2),var=c(1,1,3,3,3,1,3
> ,3,1,3))
> >>? ?  subid year var
> >> 1? ?  36 1999?  1
> >> 2? ?  36 2001?  1
> >> 3? ?  36 2003?  3
> >> 4? ?  36 2005?  3
> >> 5? ?  36 2007?  3
> >> 6? ?  47 1999?  1
> >> 7? ?  47 2001?  3
> >> 8? ?  47 2003?  3
> >> 9? ?  47 2005?  1
> >> 10? ? 47 2007?  3
> >>>
> >>
> >> I need:
> >> 36 2003?  3
> >> 47 2001?  3
> >> 47 2005?  1
> >> 47 2007?  3
> >>
> >> I am trying to use ddply over subid and use the diff function, but it is not working quiet
> right.
> >>
> >>> dd <- ddply(df1,.(subid),summarize,delta=diff(var) != 0)
> >>> dd
> >>? ? subid delta
> >> 1? ? 36 FALSE
> >> 2? ? 36? TRUE
> >> 3? ? 36 FALSE
> >> 4? ? 36 FALSE
> >> 5? ? 47? TRUE
> >> 6? ? 47 FALSE
> >> 7? ? 47? TRUE
> >> 8? ? 47? TRUE
> >>
> >> I would appreciate any help on this.
> >> Thank You!
> >> -ST
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >
> > David Winsemius
> > Alameda, CA, USA
> >
> 
> David Winsemius
> Alameda, CA, USA
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.