Dear list,
I get some strange results with daply from the plyr package. In the
example below, the average age per municipality for employed en
unemployed is calculated. If I do this using tapply (see code below) I
get the following result:
no yes
A NA 36.94931
B 51.22505 34.24887
C 48.05759 51.00198
If I do this using daply:
municipality no yes
A 36.94931 48.05759
B 51.22505 51.00198
C 34.24887 NA
daply generates the same numbers. However, these are not in the
correct cells. For example, in municipality A everybody is employed.
Therefore, the NA should be in the cell for unemployed in municipality
A.
Am I using daply incorrectly or is there indeed something wrong with
the output of daply?
Regards,
Jan
I am using version 1.1 of the plyr-package.
# Generate some test data
data.test <- data.frame(
municipality=rep(LETTERS[1:3], each=10),
employed=sample(c("yes", "no"), 30, replace=TRUE),
age=runif(30,20,70))
# Make sure everybody is employed in municipality A
data.test$employed[data.test$municipality == "A"] <- "yes"
# Compare the output of tapply:
tapply(data.test$age, list(data.test$municipality, data.test$employed),
mean)
# to that of daply:
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
# results of ddply are the samen as tapply
ddply(data.test, .(municipality, employed), function(d){mean(d$age)} )
Strange output daply with empty strata
4 messages · Dennis Murphy, Hadley Wickham, Jan van der Laan
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100909/d3c73abb/attachment.pl>
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
??????????? employed
municipality?????? no????? yes
?????????? A 41.58759 44.67463
?????????? B 55.57407 43.82545
?????????? C 43.59330?????? NA
The .drop argument has a different meaning in daply. Some R functions have
an na.last argument, and it may be that somewhere in daply, there is a
function call that moves all NAs to the end. The means are in the right
order except for the first, where the NA is supposed to be, so everything is
offset in the table by 1. I've cc'ed Hadley on this.
This is a bug, which I've fixed in the development version (hopefully to be released next week). In the plyr 1.2:
daply(data.test, .(municipality, employed), function(d){mean(d$age)} )
employed
municipality no yes
A NA 39.49980
B 44.69291 51.63733
C 57.38072 45.28978
Hadley
Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/
This is a bug, which I've fixed in the development version (hopefully to be released next week). In the plyr 1.2:
OK, thank you both for your answers. I'll wait for the next version. Regards, Jan