Warning: as.numeric reorders factor data
The behavior makes more sense now but is in need of clarification in the help files. Specifically, aggregate should mention that it is converting arguments to characters. Factoring a numeric vector leads to what you might expect, factors ordered numerically. So, even though I knew the by variables were being factored, it seemed they should be okay. For instance, > factor(c(1:15)) [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Ultimately, it occurred to me after much staring at the output that factoring must be doing a character conversion first, and that is how I figured out my workaround. As it turns out, the workaround is in the FAQ (albeit filed under Miscellanea). So, ultimately the problem may not be in as.numeric. Since one needs to know something about the internal processing of aggregate to use it, it might be that this should be in the help files.
Thomas Lumley wrote:
On Sun, 8 Dec 2002, Bud Gibson wrote:
Thanks for the clarification. It's nice to know that there is some systematicity to the behavior. Is this documented anywhere? I did look at the help for as.numeric, and it makes no mention that it is coercing factors based on their level.
Well, the help page for as.numeric says
`as.numeric' for factors yields the codes underlying the factor
levels, not the numeric representation of the labels.
This may be obvious to those deeply immersed in R and its machinations, but to those who think the number they see on the screen should just become a number when it is coerced to one, it is disconcerting.
Yes it is. It might have been better if at the dawn of time codes() had been defined to do what as.numeric does and as.numeric to do what you expect. However, it's not completely obvious: what should as.numeric do with a factor of postal codes whose levels are "3163" "90210" and "OX1 3DP"?
Further, if I just factor the same vector, and then coerce it back to numeric, the order I would have expected is preserved. I did not report that test because it seemed irrelevant. Why isn't aggregate just doing that?
Because when you have more than one `by' variable in aggegrate it needs to make a factor of the combined levels, which it does by pasting them together as characters.
My cut is that there should be some warning in the documentation, perhaps in aggregate, about the specific assumptions used in making implicit transformations and what one can expect.
It might be worth help(aggregate) mentioning that the variables are turned into factors. -thomas