Predicted values from glm() when linear predictor is NA. - R-help

John Fox

Thu, Jul 28, 2022 10:47 AM #

Dear Jeff,

I'm not sure that pursuing this further is sensible, since I don't think 
that we disagree about anything of consequence, but I'll take one more 
crack at explaining why I think lm()'s behaviour is reasonable, if not 
the only approach:

On 2022-07-28 12:50 p.m., Jeff Newmiller wrote:

But the model is "solved" by reducing the columns of the model matrix to 
a linearly independent set, providing a basis for the column space of 
the original rank-deficient model matrix.

One could do newdata %*% coef(model, complete=FALSE).

There are advantages and disadvantages to reducing the model matrix to 
full column rank. The strategy you advocate -- to retain the 
deficient-rank model matrix but constrain the coefficients by setting 
one arbitrarily to 0 -- is coherent, and is similar, if I remember 
right, to what's implemented in the linear model procedure in SAS. This 
is really six of one and a half-dozen of the other.

Best,
John

On July 28, 2022 8:46:13 AM PDT, John Fox <jfox at mcmaster.ca> wrote:

Dear Jeff,

On 2022-07-28 11:12 a.m., Jeff Newmiller wrote:

No, in this case I think I needed the "obvious" breakdown. Still digesting, though... I would prefer that if an arbitrary selection had been made that it be explicit .. the NA should be replaced with zero if the singular.ok argument is TRUE, rather than making that interpretation in predict.glm.

That's one way to think about, but another is that the model matrix X has 10 columns but is of rank 9. Thus 9 basis vectors are needed to span the column space of X, and a simple way to provide a basis is to eliminate a redundant column, hence the NA. The fitted values y-hat in a linear model are the orthogonal projection of y onto the space spanned by the columns of X, and are thus independent of the basis chosen. A GLM is a little more complicated, but it's still the column space of X that's important.

Best,
John

On July 28, 2022 5:45:35 AM PDT, John Fox <jfox at mcmaster.ca> wrote:

Dear Jeff,

On 2022-07-28 1:31 a.m., Jeff Newmiller wrote:

But "disappearing" is not what NA is supposed to do normally. Why is it being treated that way here?

NA has a different meaning here than in data.

By default, in glm() the argument singular.ok is TRUE, and so estimates are provided even when there are singularities, and even though the singularities are resolved arbitrarily.

In this model, the columns of the model matrix labelled LifestageL1 and TrtTime:LifestageL1 are perfectly collinear -- the second is 12 times the first (both have 0s in the same rows and either 1 or 12 in three of the rows) -- and thus both can't be estimated simultaneously, but the model can be estimated by eliminating one or the other (effectively setting its coefficient to 0), or by taking any linear combination of the two regressors (i.e., using any regressor with 0s and some other value). The fitted values under the model are invariant with respect to this arbitrary choice.

My apologies if I'm stating the obvious and misunderstand your objection.

Best,
John

On July 27, 2022 7:04:20 PM PDT, John Fox <jfox at mcmaster.ca> wrote:

Dear Rolf,

The coefficient of TrtTime:LifestageL1 isn't estimable (as you explain) and by setting it to NA, glm() effectively removes it from the model. An equivalent model is therefore

fit2 <- glm(cbind(Dead,Alive) ~ TrtTime + Lifestage +

+               I((Lifestage == "Egg + L1")*TrtTime) +
+               I((Lifestage == "L1 + L2")*TrtTime) +
+               I((Lifestage == "L3")*TrtTime),
+             family=binomial, data=demoDat)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

cbind(coef(fit, complete=FALSE), coef(fit2))

                                    [,1]         [,2]
(Intercept)                -0.91718302  -0.91718302
TrtTime                     0.88846195   0.88846195
LifestageEgg + L1         -45.36420974 -45.36420974
LifestageL1                14.27570572  14.27570572
LifestageL1 + L2           -0.30332697  -0.30332697
LifestageL3                -3.58672631  -3.58672631
TrtTime:LifestageEgg + L1   8.10482459   8.10482459
TrtTime:LifestageL1 + L2    0.05662651   0.05662651
TrtTime:LifestageL3         1.66743472   1.66743472

There is no problem computing fitted values for the model, specified either way. That the fitted values when Lifestage == "L1" all round to 1 on the probability scale is coincidental -- that is, a consequence of the data.

I hope this helps,
John

On 2022-07-27 8:26 p.m., Rolf Turner wrote:

I have a data frame with a numeric ("TrtTime") and a categorical
("Lifestage") predictor.

Level "L1" of Lifestage occurs only with a single value of TrtTime,
explicitly 12, whence it is not possible to estimate a TrtTime "slope"
when Lifestage is "L1".

Indeed, when I fitted the model

        fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage, family=binomial,
                   data=demoDat)

I got:

as.matrix(coef(fit))
                                      [,1]
(Intercept)                -0.91718302
TrtTime                     0.88846195
LifestageEgg + L1         -45.36420974
LifestageL1                14.27570572
LifestageL1 + L2           -0.30332697
LifestageL3                -3.58672631
TrtTime:LifestageEgg + L1   8.10482459
TrtTime:LifestageL1                 NA
TrtTime:LifestageL1 + L2    0.05662651
TrtTime:LifestageL3         1.66743472

That is, TrtTime:LifestageL1 is NA, as expected.

I would have thought that fitted or predicted values corresponding to
Lifestage = "L1" would thereby be NA, but this is not the case:

predict(fit)[demoDat$Lifestage=="L1"]
          26       65      131
24.02007 24.02007 24.02007

fitted(fit)[demoDat$Lifestage=="L1"]
     26  65 131
      1   1   1

That is, the predicted values on the scale of the linear predictor are
large and positive, rather than being NA.

What this amounts to, it seems to me, is saying that if the linear
predictor in a Binomial glm is NA, then "success" is a certainty.
This strikes me as being a dubious proposition.  My gut feeling is that
misleading results could be produced.

Can anyone explain to me a rationale for this behaviour pattern?
Is there some justification for it that I am not currently seeing?
Any other comments?  (Please omit comments to the effect of "You are as
thick as two short planks!". :-) )

I have attached the example data set in a file "demoDat.txt", should
anyone want to experiment with it.  The file was created using dput() so
you should access it (if you wish to do so) via something like

        demoDat <- dget("demoDat.txt")

Thanks for any enlightenment.

cheers,

Rolf Turner

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

John Fox, Professor Emeritus
McMaster University
Hamilton, Ontario, Canada
web: https://socialsciences.mcmaster.ca/jfox/