Skip to content

Problem with ddply in the plyr-package: surprising output of a date-column

9 messages · Christoph Jäckel, Peter Ehlers, William Dunlap +2 more

#
Hi Together,

I have a problem with the plyr package - more precisely with the ddply
function - and would be very grateful for any help. I hope the example
here is precise enough for someone to identify the problem. Basically,
in this step I want to identify observations that are identical in
terms of certain identifiers (ID1, ID2, ID3) and just want to save
those observations (in this step, without deleting any rows or
manipulating any data) in a separate data.frame. However, I get the
warning message below and the column with dates is messed up.
Interestingly, the value column (the type is factor here, but if you
change that with as.integer it doesn't make any difference) is handled
correctly. Any idea what I do wrong?

df <- data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),

Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-05-5","1985-05-6","1985-05-7"),
                 Value=c(1,2,3,4,5,6,7)))
df[,1] <- as.character(df[,1])
df[,2] <- as.character(df[,2])
df$Date   <- strptime(df$Date,"%Y-%m-%d")

#Apparently there are two observation that have the same IDs: ID1=2 and ID1=4
ddply(df,.(ID1,ID2,ID3),nrow)
#I want to save those IDs in a separate data.frame, so the desired output is:
df[c(2:3,6:7),]

#My idea: Write a custom function that only returns observations with
multiple rows.
#Seems to work except that the Date column doesn't make any sense anymore
#Warning message: In output[[var]][rng] <- df[[var]]: number of items
to replace is not a multiple of replacement length
ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})

#Notice that it works perfectly if I only have one observation with
multiple rows
ddply(df[1:6,],.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})

Thanks in advance,

Christoph

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Christoph J?ckel (Dipl.-Kfm.)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Research Assistant

Chair for Financial Management and Capital Markets | Lehrstuhls f?r
Finanzmanagement und Kapitalm?rkte

TUM School of Management | Technische Universit?t M?nchen

Arcisstr. 21 | D-80333 M?nchen | Germany
#
On 4/25/2011 10:19 AM, Christoph J?ckel wrote:
Works for me:

 > df[c(2:3,6:7),]
   ID1 ID2 ID3      Date Value
2   2   b  v1 1985-05-2     2
3   2   b  v1 1985-05-3     3
6   4   e  v1 1985-05-6     6
7   4   e  v1 1985-05-7     7
 > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
   ID1 ID2 ID3      Date Value
1   2   b  v1 1985-05-2     2
2   2   b  v1 1985-05-3     3
3   4   e  v1 1985-05-6     6
4   4   e  v1 1985-05-7     7
 > sessionInfo()
R version 2.13.0 (2011-04-13)
Platform: x86_64-pc-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.5.2

loaded via a namespace (and not attached):
[1] tools_2.13.0

A couple of things: there was just an update of plyr to 1.5.2; maybe 
that fixes what you are seeing?  Also, your df consists of only factors. 
  cbind-ing the data before turning it into a data.frame makes it a 
character matrix which gets converted to factors.

 > str(df)
'data.frame':   7 obs. of  5 variables:
  $ ID1  : Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4
  $ ID2  : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
  $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
  $ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1 2 3 4 5 6 7
  $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7

Maybe that has something to do with the odd "dates" since they are not 
really dates at all, just string representations of factor levels. 
Compare with:

DF <- data.frame(ID1=c(1,2,2,3,3,4,4),
	ID2=c('a','b','b','c','d','e','e'),
	ID3=c("v1","v1","v1","v1","v2","v1","v1"),
	Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
		"1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
	Value=c(1,2,3,4,5,6,7))
str(DF)
#'data.frame':   7 obs. of  5 variables:
# $ ID1  : num  1 2 2 3 3 4 4
# $ ID2  : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
# $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
# $ Date : Date, format: "1985-05-01" "1985-05-02" ...
# $ Value: num  1 2 3 4 5 6 7

This version also works for me.

ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
#  ID1 ID2 ID3       Date Value
#1   2   b  v1 1985-05-02     2
#2   2   b  v1 1985-05-03     3
#3   4   e  v1 1985-05-06     6
#4   4   e  v1 1985-05-07     7

  
    
#
On 2011-04-25 10:19, Christoph J?ckel wrote:
I would characterize your problem as:
a) using strptime - this is what gives ddply() fits;

b) not using str() to check whether R agrees with
    you with respect to your data;

c) using cbind() inside data.frame(). This isn't
    wrong, but is rarely (in my experience) useful.

If you use as.Date (or even nothing) on your Date
variable, you'll find that ddply does what you want.
To see why it doesn't work with strptime, check
str(df) and then ?Posixlt. You've converted Date
values to lists.

My comment about cbind() is to warn you that your
Values variable, as you have constructed it, is
a factor.

Peter Ehlers
#
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
The OP's data.frame contained a POSIXlt (not factor) object
in the "Date" column
  > str(df)
  'data.frame':   7 obs. of  5 variables:
   $ ID1  : chr  "1" "2" "2" "3" ...
   $ ID2  : chr  "a" "b" "b" "c" ...
   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
   $ Date : POSIXlt, format: "1985-05-01" "1985-05-02" ...
   $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7
and apparently plyr's equivalent of rbind doesn't support that class.

If you want to continue using POSIXlt objects you can get your
immediate result without ddply; subscripting will do the job:
  > nDups <- with(df, ave(rep(0,nrow(df)), ID1, ID2, ID3, FUN=length))
  > print(nDups)
  [1] 1 2 2 1 1 2 2
  > df[nDups>1, ]
    ID1 ID2 ID3       Date Value
  2   2   b  v1 1985-05-02     2
  3   2   b  v1 1985-05-03     3
  6   4   e  v1 1985-05-06     6
  7   4   e  v1 1985-05-07     7
  > str(.Last.value)
  'data.frame':   4 obs. of  5 variables:
   $ ID1  : chr  "2" "2" "4" "4"
   $ ID2  : chr  "b" "b" "e" "e"
   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1
   $ Date : POSIXlt, format: "1985-05-02" "1985-05-03" ...
   $ Value: Factor w/ 7 levels "1","2","3","4",..: 2 3 6 7

If you need plyr for other tasks you ought to use a different
class for your date data (or wait until plyr can deal with
POSIXlt objects).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
On 4/25/2011 11:55 AM, William Dunlap wrote:
Thanks, Bill. Somehow I missed that, despite the OP having it in his 
code; I even copied it into my testing window.  It was my error for not 
running it and noting it.
plyr uses rbind.fill primarily.  And it doesn't handle columns of 
POSIXlt based on testing that directly. (Although with only one 
argument, it just passes the data.frame back, which is why when there 
was just a single duplicate, it worked; that bypassed the code that 
couldn't handle POSIXlt's.)
If you do want to change classes, both Date and POSIXct are choices that 
will work with plyr.

  
    
#
How do you get POSIXlt objects into a data frame?
'data.frame':	1 obs. of  1 variable:
 $ x: POSIXct, format: "2008-01-01"
'data.frame':	1 obs. of  1 variable:
 $ x: AsIs, format: "0"

Hadley
#
Hi together,

thank you so much for your help! The problem was indeed the
strptime-function. Replacing that with as.Date solves the problem,
both in the example I provided and in my actual data set.

I think this is a lesson for me to not use types I'm not really
familiar with (POSIXlt in this case).

Thanks again!

Christoph
On Mon, Apr 25, 2011 at 10:07 PM, Hadley Wickham <hadley at rice.edu> wrote:
--
--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Christoph J?ckel (Dipl.-Kfm.)

--------------------------------------------------------------------------------------------------------------------------------------------------------------------

Research Assistant

Chair for Financial Management and Capital Markets | Lehrstuhl f?r
Finanzmanagement und Kapitalm?rkte

TUM School of Management | Technische Universit?t M?nchen

Arcisstr. 21 | D-80333 M?nchen | Germany

Mailto:?christoph.jaeckel at wi.tum.de?| Web:?www.fm.wi.tum.de

Phone: +49 89 289 25482 | Fax: +49 89 289 25488



Head of Chair:

Univ.-Prof. Dr. Christoph Kaserer

----------------------------------------------------------------------------------------------------------------------------------------------------------------------

E-Mail Disclaimer

Der Inhalt dieser E-Mail ist vertraulich und ausschliesslich
fuer den bezeichneten Adressaten bestimmt. Wenn Sie nicht
der vorgesehene Adressat dieser E-Mail oder dessen Vertreter
sein sollten, so beachten Sie bitte, dass jede Form der
Kenntnisnahme, Veroeffentlichung, Vervielfaeltigung oder
Weitergabe des Inhalts dieser E-Mail unzulaessig ist. Wir
bitten Sie, sich in diesem Fall mit dem Absender der E-Mail
in Verbindung zu setzen.

The information contained in this email is confidential....{{dropped:11}}
#
On 4/25/2011 1:07 PM, Hadley Wickham wrote:
Assigning to a column after the data.frame creation step

 > df <- data.frame(x = as.POSIXlt(as.Date(c("2008-01-01"))))
 > str(df)
'data.frame':   1 obs. of  1 variable:
  $ x: POSIXct, format: "2008-01-01"
 > dput(df)
structure(list(x = structure(1199145600, class = c("POSIXct",
"POSIXt"), tzone = "UTC")), .Names = "x", row.names = c(NA, -1L
), class = "data.frame")
 > df$x <- as.POSIXlt(as.Date(c("2008-01-01")))
 > str(df)
'data.frame':   1 obs. of  1 variable:
  $ x: POSIXlt, format: "2008-01-01"
 > dput(df)
structure(list(x = structure(list(sec = 0, min = 0L, hour = 0L,
     mday = 1L, mon = 0L, year = 108L, wday = 2L, yday = 0L, isdst = 
0L), .Names = c("sec",
"min", "hour", "mday", "mon", "year", "wday", "yday", "isdst"
), class = c("POSIXlt", "POSIXt"), tzone = "UTC")), .Names = "x", 
row.names = c(NA,
-1L), class = "data.frame")

This is reminiscent of the 1d array problem; there are types that are 
coerced into other types when passed as part of a data.frame constructor 
(data.frame call), but are not coerced when assigned to a column.

Looking at help pages, calls to data.frame call as.data.frame on each 
argument; `[<-.data.frame` has a section on coercion which starts "The 
story over when replacement values are coerced is a complicated one, and 
one that has changed during R's development. This section is a guide 
only." which makes me think it is not all that well defined.

Digging more, there is a as.data.frame.POSIXlt, although the help page 
for it (DateTimeClasses in base) does not mention it or document it.  It 
is documented, though, in as.data.frame (which also has comments about 
coercing 1 dimensional arrays).

So, potentially, there could be differences with any class that has an 
as.data.frame method because it will be treated differently if passed to 
data.frame versus a column assignment with `[<-.data.frame`

 > methods("as.data.frame")
  [1] as.data.frame.aovproj*        as.data.frame.array
  [3] as.data.frame.AsIs            as.data.frame.character
  [5] as.data.frame.complex         as.data.frame.data.frame
  [7] as.data.frame.Date            as.data.frame.default
  [9] as.data.frame.difftime        as.data.frame.factor
[11] as.data.frame.ftable*         as.data.frame.function
[13] as.data.frame.idf*            as.data.frame.integer
[15] as.data.frame.list            as.data.frame.logical
[17] as.data.frame.logLik*         as.data.frame.matrix
[19] as.data.frame.model.matrix    as.data.frame.numeric
[21] as.data.frame.numeric_version as.data.frame.ordered
[23] as.data.frame.POSIXct         as.data.frame.POSIXlt
[25] as.data.frame.raw             as.data.frame.table
[27] as.data.frame.ts              as.data.frame.vector

So, I suppose it is working as documented.  Though I wonder how long ago 
it was that someone (who has been using R regularly for at least a year) 
actually read the entire help page for data.frame and/or as.data.frame. 
  It's one of those things you think you know and understand until you 
find out you don't.
#
On 2011-04-25 13:07, Hadley Wickham wrote:
To mimic the OP's code

   df <- data.frame(x = "2008-01-01")
   df$x <- as.POSIXlt(df$x, "%Y-%m-%d")
   str(df)
   #'data.frame':   1 obs. of  1 variable:
   # $ x: POSIXlt, format: "2008-01-01"

Peter Ehlers