Skip to content

Creating a new by variable in a dataframe

11 messages · ramoss, William Dunlap, Flavio Barros +1 more

#
Hello,

I have a dataframe w/ 3 variables of interest: transaction,date(tdate) &
time(event_tim).
How could I create a 4th variable (last_trans) that would flag the last
transaction of the day for each day?
In SAS I use:
proc sort data=all6;
by tdate event_tim;
run;
         /*Create last transaction flag per day*/
data all6;
  set all6;
  by tdate event_tim;
  last_trans=last.tdate;

Thanks ahead for any suggestions.



--
View this message in context: http://r.789695.n4.nabble.com/Creating-a-new-by-variable-in-a-dataframe-tp4646782.html
Sent from the R help mailing list archive at Nabble.com.
#
Suppose your data frame is
d <- data.frame(
     stringsAsFactors = FALSE,
     transaction = c("T01", "T02", "T03", "T04", "T05", "T06", 
        "T07", "T08", "T09", "T10"),
     date = c("2012-10-19", "2012-10-19", "2012-10-19", 
        "2012-10-19", "2012-10-22", "2012-10-23", 
        "2012-10-23", "2012-10-23", "2012-10-23", 
        "2012-10-23"),
     time = c("08:00", "09:00", "10:00", "11:00", "12:00", 
        "13:00", "14:00", "15:00", "16:00", "17:00"
        ))
(Convert the date and time to your favorite classes, it doesn't matter here.)

A general way to say if an item is the last of its group is:
  isLastInGroup <- function(...)  ave(logical(length(..1)), ..., FUN=function(x)seq_along(x)==length(x))
  is_last_of_dayA <- with(d, isLastInGroup(date))
If you know your data is sorted by date you could save a little time for large
datasets by using
  isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
  is_last_of_dayB <- isLastInRun(d$date)
The above d is sorted by date so you get the same results for both:
  > cbind(d, is_last_of_dayA, is_last_of_dayB)
     transaction       date  time is_last_of_dayA is_last_of_dayB
  1          T01 2012-10-19 08:00           FALSE           FALSE
  2          T02 2012-10-19 09:00           FALSE           FALSE
  3          T03 2012-10-19 10:00           FALSE           FALSE
  4          T04 2012-10-19 11:00            TRUE            TRUE
  5          T05 2012-10-22 12:00            TRUE            TRUE
  6          T06 2012-10-23 13:00           FALSE           FALSE
  7          T07 2012-10-23 14:00           FALSE           FALSE
  8          T08 2012-10-23 15:00           FALSE           FALSE
  9          T09 2012-10-23 16:00           FALSE           FALSE
  10         T10 2012-10-23 17:00            TRUE            TRUE


Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Hi,

May be this helps you: 

dat1<-read.table(text="
tdate? event_tim? transaction
1/10/2012?? 2?? 14
1/10/2012?? 4?? 28
1/10/2012?? 6?? 42
1/10/2012?? 8?? 14
2/10/2012?? 6?? 46
2/10/2012?? 9?? 64
2/10/2012?? 8?? 71
3/10/2012? 3?? 85
3/10/2012?? 1?? 14
3/10/2012?? 4?? 28
9/10/2012?? 5?? 51
9/10/2012?? 9?? 66
9/20/2012? 12?? 84
",sep="",header=TRUE,stringsAsFactors=FALSE)
dat2<-dat1[with(dat1,order(tdate,event_tim)),]
dat2$tdate<-as.Date(dat2$tdate,format="%m/%d/%Y")
dat3<-dat2
?dat3$last_trans<-NA
library(plyr)
dat4<-merge(dat3,ddply(dat2,.(tdate),tail,1))
dat4$last_trans<-dat4$transaction
?res<-merge(dat4,dat2,all=TRUE)
?res
#??????? tdate event_tim transaction last_trans
#1? 2012-01-10???????? 2????????? 14???????? NA
#2? 2012-01-10???????? 4????????? 28???????? NA
#3? 2012-01-10???????? 6????????? 42???????? NA
#4? 2012-01-10???????? 8????????? 14???????? 14
#5? 2012-02-10???????? 6????????? 46???????? NA
#6? 2012-02-10???????? 8????????? 71???????? NA
#7? 2012-02-10???????? 9????????? 64???????? 64
#8? 2012-03-10???????? 1????????? 14???????? NA
#9? 2012-03-10???????? 3????????? 85???????? NA
#10 2012-03-10???????? 4????????? 28???????? 28
#11 2012-09-10???????? 5????????? 51???????? NA
#12 2012-09-10???????? 9????????? 66???????? 66
#13 2012-09-20??????? 12????????? 84???????? 84





----- Original Message -----
From: ramoss <ramine.mossadegh at finra.org>
To: r-help at r-project.org
Cc: 
Sent: Friday, October 19, 2012 1:51 PM
Subject: [R] Creating a new by variable in a dataframe

Hello,

I have a dataframe w/ 3 variables of interest: transaction,date(tdate) &
time(event_tim).
How could I create a 4th variable (last_trans) that would flag the last
transaction of the day for each day?
In SAS I use:
proc sort data=all6;
by tdate event_tim;
run;
? ? ? ?  /*Create last transaction flag per day*/
data all6;
? set all6;
? by tdate event_tim;
? last_trans=last.tdate;

Thanks ahead for any suggestions.



--
View this message in context: http://r.789695.n4.nabble.com/Creating-a-new-by-variable-in-a-dataframe-tp4646782.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
Thanks for all the help guys.

This worked for me:

all6 <- arrange(all6, tdate,event_tim)
lt <- ddply(all6,.(tdate),tail,1) 
lt$last_trans <-'Y'
all6 <-merge(all6,lt, by.x=c("tdate","event_tim"),
by.y=c("tdate","event_tim"),all.x=TRUE)




--
View this message in context: http://r.789695.n4.nabble.com/Creating-a-new-by-variable-in-a-dataframe-tp4646782p4646799.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi,

In addition to merge(), you can also use join()
dat1<-read.table(text="
tdate? event_tim? transaction
1/10/2012?? 2?? 14
1/10/2012?? 4?? 28
1/10/2012?? 6?? 42
1/10/2012?? 8?? 14
2/10/2012?? 6?? 46
2/10/2012?? 9?? 64
2/10/2012?? 8?? 71
3/10/2012? 3?? 85
3/10/2012?? 1?? 14
3/10/2012?? 4?? 28
9/10/2012?? 5?? 51
9/10/2012?? 9?? 66
9/20/2012? 12?? 84
",sep="",header=TRUE,stringsAsFactors=FALSE)
dat2<-dat1[with(dat1,order(tdate,event_tim)),]
aggres<-aggregate(dat2[,-1],by=list(tdate=dat2$tdate),tail,1)
aggres$last_trans<-"Y"
library(plyr)

join(dat2,aggres,by=intersect(names(dat2),names(aggres)),type="full")
#?????? tdate event_tim transaction last_trans
#1? 1/10/2012???????? 2????????? 14?????? <NA>
#2? 1/10/2012???????? 4????????? 28?????? <NA>
#3? 1/10/2012???????? 6????????? 42?????? <NA>
#4? 1/10/2012???????? 8????????? 14????????? Y
#5? 2/10/2012???????? 6????????? 46?????? <NA>
#6? 2/10/2012???????? 8????????? 71?????? <NA>
#7? 2/10/2012???????? 9????????? 64????????? Y
#8? 3/10/2012???????? 1????????? 14?????? <NA>
#9? 3/10/2012???????? 3????????? 85?????? <NA>
#10 3/10/2012???????? 4????????? 28????????? Y
#11 9/10/2012???????? 5????????? 51?????? <NA>
#12 9/10/2012???????? 9????????? 66????????? Y
#13 9/20/2012??????? 12????????? 84????????? Y


A.K.

----- Original Message -----
From: ramoss <ramine.mossadegh at finra.org>
To: r-help at r-project.org
Cc: 
Sent: Friday, October 19, 2012 1:51 PM
Subject: [R] Creating a new by variable in a dataframe

Hello,

I have a dataframe w/ 3 variables of interest: transaction,date(tdate) &
time(event_tim).
How could I create a 4th variable (last_trans) that would flag the last
transaction of the day for each day?
In SAS I use:
proc sort data=all6;
by tdate event_tim;
run;
? ? ? ?  /*Create last transaction flag per day*/
data all6;
? set all6;
? by tdate event_tim;
? last_trans=last.tdate;

Thanks ahead for any suggestions.



--
View this message in context: http://r.789695.n4.nabble.com/Creating-a-new-by-variable-in-a-dataframe-tp4646782.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
HI,
Without using "ifelse()" on the same example dataset.
d <- data.frame(stringsAsFactors = FALSE, transaction = c("T01", "T02",
"T03", "T04", "T05", "T06", "T07", "T08", "T09", "T10"),date =
c("2012-10-19", "2012-10-19", "2012-10-19", "2012-10-19", "2012-10-22",
"2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23"),time
= c("08:00", "09:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00",
"16:00", "17:00"))

d$date <- as.Date(d$date,format="%Y-%m-%d")
d$time<-strptime(d$time,format="%H:%M")$hour
d$flag<-unlist(rbind(lapply(split(d,d$date),function(x) x[3]==max(x[3]))))
d$datetime<-as.POSIXct(paste(d$date,d$time," "),format="%Y-%m-%d %H")
d1<-d[,c(1,5,4)]
?d1
#?? transaction??????????? datetime? flag
#1????????? T01 2012-10-19 08:00:00 FALSE
#2????????? T02 2012-10-19 09:00:00 FALSE
#3????????? T03 2012-10-19 10:00:00 FALSE
#4????????? T04 2012-10-19 11:00:00? TRUE
#5????????? T05 2012-10-22 12:00:00? TRUE
#6????????? T06 2012-10-23 13:00:00 FALSE
#7????????? T07 2012-10-23 14:00:00 FALSE
#8????????? T08 2012-10-23 15:00:00 FALSE
#9????????? T09 2012-10-23 16:00:00 FALSE
#10???????? T10 2012-10-23 17:00:00? TRUE

str(d1)
#'data.frame':??? 10 obs. of? 3 variables:
# $ transaction: chr? "T01" "T02" "T03" "T04" ...
# $ datetime?? : POSIXct, format: "2012-10-19 08:00:00" "2012-10-19 09:00:00" ...
# $ flag?????? : logi? FALSE FALSE FALSE TRUE TRUE FALSE ...

A.K.


----- Original Message -----
From: Flavio Barros <flaviomargarito at gmail.com>
To: William Dunlap <wdunlap at tibco.com>
Cc: "r-help at r-project.org" <r-help at r-project.org>; ramoss <ramine.mossadegh at finra.org>
Sent: Friday, October 19, 2012 4:24 PM
Subject: Re: [R] Creating a new by variable in a dataframe

I think i have a better solution

*## Example data.frame*
d <- data.frame(stringsAsFactors = FALSE, transaction = c("T01", "T02",
"T03", "T04", "T05", "T06", "T07", "T08", "T09", "T10"),date =
c("2012-10-19", "2012-10-19", "2012-10-19", "2012-10-19", "2012-10-22",
"2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23", "2012-10-23"),time
= c("08:00", "09:00", "10:00", "11:00", "12:00", "13:00", "14:00", "15:00",
"16:00", "17:00"))

*## As date tranfomation*
d$date <- as.Date(d$date)
d$time <- strptime(d$time, format='%H')

library(reshape)

*## Create factor to split the data*
fdate <- factor(format(d$date, '%D'))

*## Create a list with logical TRUE when is the last transaction*
ex <- sapply(split(d, fdate), function(x)
ifelse(as.numeric(x[,'time'])==max(as.numeric(x[,'time'])),T,F))

*## Coerce to logical vector*
flag <- unlist(rbind(ex))

*## With reshape we have the transform function e can add the flag column *
d <- transform(d, flag = flag)
On Fri, Oct 19, 2012 at 3:51 PM, William Dunlap <wdunlap at tibco.com> wrote:

            

  
    
#
I think that line is unnecessarily complicated. lapply() returns a list
and rbind applied to one argument, L, mainly adds dimensions c(length(L),1)
to it (it also changes its names to rownames).  unlist doesn't care about
the dimensions, so you may as well leave out the rbind.  The only difference
in the results with and without calling rbind is that the rbind version omits
the names from flag.  Use the more direct unname() on split's output or
unlists's output if that concerns you. 

Also, if you are interested in saving time and memory when the input, d, is large,
you will be better off applying split() to just the column of the data.frame
that you want split instead of to the entire data.frame.
   d$flag2 <- unlist(lapply(unname(split(d[[3]], d$date), function(x)x==max(x))))
(I used d[[3]] instead of the more readable d$time to follow your original more closely.)

You ought to check that the data is sorted by date: otherwise these give the
wrong answer.

What result do you want when there are several transactions at the last time
in the day?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
HI Bill,

Thanks for the reply.
It was unnecessarily complicated.
d$flag<-unlist(lapply(split(d,d$date),function(x) x[3]==max(x[3])),use.names=FALSE)
#or
d$flag<-unlist(lapply(split(d,d$date),function(x) x[3]==max(x[3])))
should have done the same job.
str(d)
#'data.frame':??? 10 obs. of? 4 variables:
# $ transaction: chr? "T01" "T02" "T03" "T04" ...
# $ date?????? : Date, format: "2012-10-19" "2012-10-19" ...
# $ time?????? : int? 8 9 10 11 12 13 14 15 16 17
?#$ flag?????? : logi? FALSE FALSE FALSE TRUE TRUE FALSE ...

I am getting error messages with:
d$flag2 <- unlist(lapply(unname(split(d[[3]], d$date), function(x)x==max(x))))
Error in match.fun(FUN) : argument "FUN" is missing, with no default


A.K.





----- Original Message -----
From: William Dunlap <wdunlap at tibco.com>
To: arun <smartpink111 at yahoo.com>; Flavio Barros <flaviomargarito at gmail.com>
Cc: R help <r-help at r-project.org>; ramoss <ramine.mossadegh at finra.org>
Sent: Saturday, October 20, 2012 12:04 PM
Subject: RE: [R] Creating a new by variable in a dataframe
I think that line is unnecessarily complicated. lapply() returns a list
and rbind applied to one argument, L, mainly adds dimensions c(length(L),1)
to it (it also changes its names to rownames).? unlist doesn't care about
the dimensions, so you may as well leave out the rbind.? The only difference
in the results with and without calling rbind is that the rbind version omits
the names from flag.? Use the more direct unname() on split's output or
unlists's output if that concerns you. 

Also, if you are interested in saving time and memory when the input, d, is large,
you will be better off applying split() to just the column of the data.frame
that you want split instead of to the entire data.frame.
?  d$flag2 <- unlist(lapply(unname(split(d[[3]], d$date), function(x)x==max(x))))
(I used d[[3]] instead of the more readable d$time to follow your original more closely.)

You ought to check that the data is sorted by date: otherwise these give the
wrong answer.

What result do you want when there are several transactions at the last time
in the day?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
HI Bill,

I figured it out.
?d$flag2<-unlist(lapply(unname(split(d[[3]],d$date)),function(x) x==max(x)))
# [1] FALSE FALSE FALSE? TRUE? TRUE FALSE FALSE FALSE FALSE? TRUE

")" created the error.

A.K.




----- Original Message -----
From: William Dunlap <wdunlap at tibco.com>
To: arun <smartpink111 at yahoo.com>; Flavio Barros <flaviomargarito at gmail.com>
Cc: R help <r-help at r-project.org>; ramoss <ramine.mossadegh at finra.org>
Sent: Saturday, October 20, 2012 12:04 PM
Subject: RE: [R] Creating a new by variable in a dataframe
I think that line is unnecessarily complicated. lapply() returns a list
and rbind applied to one argument, L, mainly adds dimensions c(length(L),1)
to it (it also changes its names to rownames).? unlist doesn't care about
the dimensions, so you may as well leave out the rbind.? The only difference
in the results with and without calling rbind is that the rbind version omits
the names from flag.? Use the more direct unname() on split's output or
unlists's output if that concerns you. 

Also, if you are interested in saving time and memory when the input, d, is large,
you will be better off applying split() to just the column of the data.frame
that you want split instead of to the entire data.frame.
?  d$flag2 <- unlist(lapply(unname(split(d[[3]], d$date), function(x)x==max(x))))
(I used d[[3]] instead of the more readable d$time to follow your original more closely.)

You ought to check that the data is sorted by date: otherwise these give the
wrong answer.

What result do you want when there are several transactions at the last time
in the day?

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
I'm sorry, I stuck in the unname() in the mail but did not run it - its closing
parenthesis should be after split's closing parenthisis, not at the end.
[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com