Dear all,
Today I figured out that there is a neat function called droplevels,
which, well, drops unused levels in a data frame. I tried the function
with some of my data sets and it turned out that not only the unused
levels were dropped but also the contrasts I set via "C". I had a look
into the code, and this behaviour arises from the fact that droplevels
uses simply factor to drop the unused levels, which uses the default
contrasts as set by options("contrasts").
I think this behaviour is annoying, because if one does not look
carefully enough, one looses the contrasts silently. Hence may I suggest
to change the code of droplevels to something like the following:
droplevels <- function (x, except = NULL, ...) {
ix <- vapply(x, is.factor, NA)
if (!is.null(except))
ix[except] <- FALSE
co <- lapply(x[ix], function(fa) attr(fa, "contrasts"))
x[ix] <- mapply(function(fa, co) {
if (nlevels(factor(fa)) == 1) {
factor(fa)
} else {
C(factor(fa), co)
}
}, x[ix], co, SIMPLIFY = FALSE)
x
}
which keeps the original contrasts AND drops the unused levels?
Similarly, droplevels.factor should be changed to
droplevels.factor <- function (x, ...) {
co <- attr(x, "contrasts")
if (nlevels(factor(x)) == 1) {
factor(x)
} else {
C(factor(x), co)
}
}
The nlevels statement is necessary since C does not work if there are
less than 2 levels.
Any comments appreciated.
KR,
-Thorn
droplevels: drops contrasts as well
3 messages · Thaler, Thorn, LAUSANNE, Applied Mathematics, Thomas Lumley
3 days later
On Fri, Oct 21, 2011 at 5:57 AM, Thaler, Thorn, LAUSANNE, Applied
Mathematics <Thorn.Thaler at rdls.nestle.com> wrote:
Dear all,
Today I figured out that there is a neat function called droplevels,
which, well, drops unused levels in a data frame. I tried the function
with some of my data sets and it turned out that not only the unused
levels were dropped but also the contrasts I set via "C". I had a look
into the code, and this behaviour arises from the fact that droplevels
uses simply factor to drop the unused levels, which uses the default
contrasts as set by options("contrasts").
I think this behaviour is annoying, because if one does not look
carefully enough, one looses the contrasts silently. Hence may I suggest
to change the code of droplevels to something like the following:
This silently changes the contrasts -- eg, if the first level of the factor is one of the empty levels, the reference level used by contr.treatment() will change. Also, if the contrasts are a matrix rather than specifying a contrast function, the matrix will be invalid for the the new factor. I think just having a warning would be better -- in general it's not clear what (if anything) it means to have the same contrasts on factors with different numbers of levels. -thomas
Thomas Lumley Professor of Biostatistics University of Auckland
I think this behaviour is annoying, because if one does not look carefully enough, one looses the contrasts silently. Hence may I
suggest
to change the code of droplevels to something like the following:
This silently changes the contrasts -- eg, if the first level of the factor is one of the empty levels, the reference level used by contr.treatment() will change. Also, if the contrasts are a matrix rather than specifying a contrast function, the matrix will be invalid for the the new factor.
Well, you are right and while I'm not so much concerned about the first issue you've outlined (the change in the baseline - I think if I decide to drop unused levels, I'm aware that a non-existing level cannot be the baseline any more), the second point is clearly an issue I've overlooked.
I think just having a warning would be better -- in general it's not clear what (if anything) it means to have the same contrasts on factors with different numbers of levels.
Would be an option. I think this should be the minimum. Still, I think a behaviour like: 1.) if contrasts are defined as matrix issue a warning and use default contrasts (that is nothing changes as compared to now, but that a warning is issued) 2.) if the contrasts are defined as a function, use the function for re-computing the contrasts. would be more desirable, as contrasts can be seen as a general setting of how coefficients should be interpreted too (e.g. for a balanced data set with sum "contrasts", the intercept corresponds to the overall mean, beta1 to the difference of the overall mean and group 1 and so on), rather than looking at them from the literal point of view (e.g. "I want to compare level A vs level B & C"). While from the latter point of view I agree that the same contrasts on factors with different numbers of levels are not really meaningful, I still see the benefit if I take the other point of view: If I drop a level, I may be still interested in comparing the overall mean with the group means bearing in mind that maybe some groups are not present any more in the data set. Do you see my point? However, it is not the biggest issue, as one can change the contrasts rather easily oneself, but I think at least some information/warning should be issued that the old contrasts are not used any more. KR, -Thorn
-thomas -- Thomas Lumley Professor of Biostatistics University of Auckland