Skip to content

aggregate(), tapply(): Why is the order of the grouping variables not kept?

2 messages · Marius Hofert, Peter Ehlers

#
Dear expeRts,

The question is rather simple: Why does aggregate (or similarly tapply()) not keep the order of the grouping variable(s)?

Here is an example:

x <- data.frame(group = rep(LETTERS[1:2], each=10),
                year  = rep(rep(2001:2005, each=2), 2),
                value = rep(1:10, each=2))
## => sorted according to group, then year
aggregate(value ~ group + year, data=x, FUN=function(z) z[1])
## => sorted according to year, then group

I rather expected this to be the default:

aggregate(value ~ year + group, data=x, FUN=function(z) z[1])[,c(2,1,3)]
## => same order as input (grouping) variables

Same with tapply:

as.data.frame(as.table(tapply(x$value, list(x$group, x$year), FUN=function(z) z[1])))


Cheers,

Marius
#
On 2013-03-11 13:52, Marius Hofert wrote:
I'm no expeRt, but suppose that we change the setup slightly:

   xx <- x[sample(nrow(x)), ]

Now what would you like

  aggregate(value ~ group + year, data=xx, FUN=function(z) z[1])

to return?

Personally, I prefer to have R return the same thing regardless
of how the input dataframe is sorted, i.e. the result should
depend only on the formula. You just have to know that the order
is to have the first factor vary most rapidly, then the next, etc.
I think that's documented somewhere, but I don't know where.

Peter Ehlers