inconsistent handling of factor, character, and logical predictors in lm()
Functions like lm() treat logical predictors as factors, *not* as
numerical variables. Not quite. A factor with all elements the same causes lm() to give an error while a logical of all TRUEs or all FALSEs just omits it from the model (it gets a coefficient of NA). This is a fairly common situation when you fit models to subsets of a big data.frame. This is an argument for fixing the single-valued-factor problem, which would become more noticeable if logicals were treated as factors. > d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750), Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE))
coef(lm(data=d, Weight ~ Age + Diseased))
(Intercept) Age DiseasedTRUE
877.7333 5.4000 -151.3333
coef(lm(data=d, Weight ~ Age + factor(Diseased)))
(Intercept) Age factor(Diseased)TRUE
877.7333 5.4000 -151.3333
coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7))
(Intercept) Age DiseasedTRUE
847.3333 13.0000 NA
coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels
coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)),
subset=Age<7)) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels Bill Dunlap TIBCO Software wdunlap tibco.com
On Sat, Aug 31, 2019 at 8:54 AM Fox, John <jfox at mcmaster.ca> wrote:
Dear Abby,
On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a at gmail.com> wrote:
I think that it would be better to handle factors, character
predictors, and logical predictors consistently.
"logical predictors" can be regarded as categorical or continuous (i.e.
0 or 1).
And the model matrix should be the same, either way.
I think that you're mistaking a coincidence for a principle. The coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. Functions like lm() treat logical predictors as factors, *not* as numerical variables. That one would get the same coefficient in either case is a consequence of the coincidence and the fact that the default contrasts for unordered factors are contr.treatment(). For example, if you changed the contrasts option, you'd get a different estimate (though of course a model with the same fit to the data and an equivalent interpretation): ------------ snip --------------
options(contrasts=c("contr.sum", "contr.poly"))
m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
m3
Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
data = iris)
Coefficients:
(Intercept) Sepal.Width I(Species == "setosa")1
2.6672 0.9418 0.8898
head(model.matrix(m3))
(Intercept) Sepal.Width I(Species == "setosa")1 1 1 3.5 -1 2 1 3.0 -1 3 1 3.2 -1 4 1 3.1 -1 5 1 3.6 -1 6 1 3.9 -1
tail(model.matrix(m3))
(Intercept) Sepal.Width I(Species == "setosa")1 145 1 3.3 1 146 1 3.0 1 147 1 2.5 1 148 1 3.0 1 149 1 3.4 1 150 1 3.0 1
lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"),
data=iris)
Call:
lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
"setosa"), data = iris)
Coefficients:
(Intercept) Sepal.Width
as.numeric(Species == "setosa")
3.5571 0.9418
-1.7797
-2*coef(m3)[3]
I(Species == "setosa")1
-1.779657
------------ snip --------------
I think the first question to be asked is, which is the best approach, categorical or continuous? The continuous approach seems simpler and more efficient to me, but output from the categorical approach may be more intuitive, for some people.
I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors. One could argue about what's the better approach but not about what lm() does. BTW, I prefer treating a logical predictor as a factor because the predictor is essentially categorical.
I note that the use factors and characters, doesn't necessarily produce consistent output, for $xlevels. (Because factors can have their levels re-ordered).
Again, this misses the point: Both factors and character predictors produce elements in $xlevels; logical predictors do not, even though they are treated in the model as factors. That factors have levels that aren't necessarily ordered alphabetically is a reason that I prefer using factors to using character predictors, but this has nothing to do with the point I was trying to make about $xlevels. Best, John ------------------------------------------------- John Fox, Professor Emeritus McMaster University Hamilton, Ontario, Canada Web: http::/socserv.mcmaster.ca/jfox
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel