inconsistent handling of factor, character, and logical predictors in lm() - R-devel

John Fox

Sat, Aug 31, 2019 10:42 AM #

Dear Bill,

Thanks for pointing this difference out -- I was unaware of it.

I think that the difference occurs in model.matrix.default(), which coerces character variables but not logical variables to factors. Later it treats both factors and logical variables as "factors" in that it applies contrasts to both, but unused factor levels are dropped while an unused logical level is not.

I don't see why logical variables shouldn't be treated just as character variables are currently, both with respect to single levels (whether this is considered an error or as collinear with the intercept and thus gets an NA coefficient) and with respect to $levels.

Best,
 John

On Aug 31, 2019, at 1:21 PM, William Dunlap via R-devel <r-devel at r-project.org> wrote:

Functions like lm() treat logical predictors as factors, *not* as

numerical variables.

Not quite.  A factor with all elements the same causes lm() to give an
error while a logical of all TRUEs or all FALSEs just omits it from the
model (it gets a coefficient of NA).  This is a fairly common situation
when you fit models to subsets of a big data.frame.  This is an argument
for fixing the single-valued-factor problem, which would become more
noticeable if logicals were treated as factors.

d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750),

Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE))

coef(lm(data=d, Weight ~ Age + Diseased))

(Intercept)          Age DiseasedTRUE
   877.7333       5.4000    -151.3333

coef(lm(data=d, Weight ~ Age + factor(Diseased)))

        (Intercept)                  Age factor(Diseased)TRUE
           877.7333               5.4000            -151.3333

coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7))

(Intercept)          Age DiseasedTRUE
   847.3333      13.0000           NA

coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7))

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
 contrasts can be applied only to factors with 2 or more levels

coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)),

subset=Age<7))
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
 contrasts can be applied only to factors with 2 or more levels

Bill Dunlap
TIBCO Software
wdunlap tibco.com


On Sat, Aug 31, 2019 at 8:54 AM Fox, John <jfox at mcmaster.ca> wrote:

Dear Abby,

On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdle.a at gmail.com> wrote:

I think that it would be better to handle factors, character

predictors, and logical predictors consistently.

"logical predictors" can be regarded as categorical or continuous (i.e.

0 or 1).

And the model matrix should be the same, either way.

I think that you're mistaking a coincidence for a principle. The
coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE.
Functions like lm() treat logical predictors as factors, *not* as numerical
variables.

That one would get the same coefficient in either case is a consequence of
the coincidence and the fact that the default contrasts for unordered
factors are contr.treatment(). For example, if you changed the contrasts
option, you'd get a different estimate (though of course a model with the
same fit to the data and an equivalent interpretation):

------------ snip --------------

options(contrasts=c("contr.sum", "contr.poly"))
m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris)
m3

Call:
lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"),
   data = iris)

Coefficients:
           (Intercept)              Sepal.Width  I(Species == "setosa")1
                2.6672                   0.9418                   0.8898

head(model.matrix(m3))

 (Intercept) Sepal.Width I(Species == "setosa")1
1           1         3.5                      -1
2           1         3.0                      -1
3           1         3.2                      -1
4           1         3.1                      -1
5           1         3.6                      -1
6           1         3.9                      -1

tail(model.matrix(m3))

   (Intercept) Sepal.Width I(Species == "setosa")1
145           1         3.3                       1
146           1         3.0                       1
147           1         2.5                       1
148           1         3.0                       1
149           1         3.4                       1
150           1         3.0                       1

lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"),

data=iris)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species ==
   "setosa"), data = iris)

Coefficients:
                   (Intercept)                      Sepal.Width
as.numeric(Species == "setosa")
                        3.5571                           0.9418
               -1.7797

-2*coef(m3)[3]

I(Species == "setosa")1
             -1.779657

------------ snip --------------

I think the first question to be asked is, which is the best approach,
categorical or continuous?
The continuous approach seems simpler and more efficient to me, but
output from the categorical approach may be more intuitive, for some
people.

I think that this misses the point I was trying to make: lm() et al. treat
logical variables as factors, not as numerical predictors. One could argue
about what's the better approach but not about what lm() does. BTW, I
prefer treating a logical predictor as a factor because the predictor is
essentially categorical.

I note that the use factors and characters, doesn't necessarily
produce consistent output, for $xlevels.
(Because factors can have their levels re-ordered).

Again, this misses the point: Both factors and character predictors
produce elements in $xlevels; logical predictors do not, even though they
are treated in the model as factors. That factors have levels that aren't
necessarily ordered alphabetically is a reason that I prefer using factors
to using character predictors, but this has nothing to do with the point I
was trying to make about $xlevels.

Best,
John

 -------------------------------------------------
 John Fox, Professor Emeritus
 McMaster University
 Hamilton, Ontario, Canada
 Web: http::/socserv.mcmaster.ca/jfox

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel