I have a data frame with a numeric ("TrtTime") and a categorical
("Lifestage") predictor.
Level "L1" of Lifestage occurs only with a single value of TrtTime,
explicitly 12, whence it is not possible to estimate a TrtTime "slope"
when Lifestage is "L1".
Indeed, when I fitted the model
fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage, family=binomial,
data=demoDat)
I got:
That is, TrtTime:LifestageL1 is NA, as expected.
I would have thought that fitted or predicted values corresponding to
Lifestage = "L1" would thereby be NA, but this is not the case:
That is, the predicted values on the scale of the linear predictor are
large and positive, rather than being NA.
What this amounts to, it seems to me, is saying that if the linear
predictor in a Binomial glm is NA, then "success" is a certainty.
This strikes me as being a dubious proposition. My gut feeling is that
misleading results could be produced.
Can anyone explain to me a rationale for this behaviour pattern?
Is there some justification for it that I am not currently seeing?
Any other comments? (Please omit comments to the effect of "You are as
thick as two short planks!". :-) )
I have attached the example data set in a file "demoDat.txt", should
anyone want to experiment with it. The file was created using dput() so
you should access it (if you wish to do so) via something like
demoDat <- dget("demoDat.txt")
Thanks for any enlightenment.
cheers,
Rolf Turner
Hi Rolf,
that's an interesting observation.
I agree that it is counter-intuitive that the fitted values / predictions are not NA.
However, I don't agree with your comment that <<if the linear predictor in a Binomial glm is NA, then "success" is a certainty>> - that seems to be a peculiarity of these data - note for these three observations there are thousands dead and zero alive and that you get similar outcomes if you omit TrtTime altogether ...
Cheers,
Andrew
--
Andrew Robinson
Chief Executive Officer, CEBRA and Professor of Biosecurity,
School/s of BioSciences and Mathematics & Statistics
University of Melbourne, VIC 3010 Australia
Tel: (+61) 0403 138 955
Email: apro at unimelb.edu.au
Website: https://researchers.ms.unimelb.edu.au/~apro at unimelb/
I acknowledge the Traditional Owners of the land I inhabit, and pay my respects to their Elders.
On 28 Jul 2022, 10:27 AM +1000, Rolf Turner <r.turner at auckland.ac.nz>, wrote:
External email: Please exercise caution
I have a data frame with a numeric ("TrtTime") and a categorical
("Lifestage") predictor.
Level "L1" of Lifestage occurs only with a single value of TrtTime,
explicitly 12, whence it is not possible to estimate a TrtTime "slope"
when Lifestage is "L1".
Indeed, when I fitted the model
fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage, family=binomial,
data=demoDat)
I got:
as.matrix(coef(fit))
[,1]
(Intercept) -0.91718302
TrtTime 0.88846195
LifestageEgg + L1 -45.36420974
LifestageL1 14.27570572
LifestageL1 + L2 -0.30332697
LifestageL3 -3.58672631
TrtTime:LifestageEgg + L1 8.10482459
TrtTime:LifestageL1 NA
TrtTime:LifestageL1 + L2 0.05662651
TrtTime:LifestageL3 1.66743472
That is, TrtTime:LifestageL1 is NA, as expected.
I would have thought that fitted or predicted values corresponding to
Lifestage = "L1" would thereby be NA, but this is not the case:
predict(fit)[demoDat$Lifestage=="L1"]
26 65 131
24.02007 24.02007 24.02007
fitted(fit)[demoDat$Lifestage=="L1"]
26 65 131
1 1 1
That is, the predicted values on the scale of the linear predictor are
large and positive, rather than being NA.
What this amounts to, it seems to me, is saying that if the linear
predictor in a Binomial glm is NA, then "success" is a certainty.
This strikes me as being a dubious proposition. My gut feeling is that
misleading results could be produced.
Can anyone explain to me a rationale for this behaviour pattern?
Is there some justification for it that I am not currently seeing?
Any other comments? (Please omit comments to the effect of "You are as
thick as two short planks!". :-) )
I have attached the example data set in a file "demoDat.txt", should
anyone want to experiment with it. The file was created using dput() so
you should access it (if you wish to do so) via something like
demoDat <- dget("demoDat.txt")
Thanks for any enlightenment.
cheers,
Rolf Turner
--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
Time is often used in this sort of problem, but really time is not relevant. A better choice is accumulated thermal units. The insect will molt when X thermal units have been accumulated. This is often expressed as degree days, but could as easily be other units like degree seconds. However, I suspect that fine time units exceeds the accuracy of the measurement system. A growth chamber might maintain 28 C, but the temperature the insect experiences might be somewhat different thereby introducing additional variability in the outcome. No thermal units accumulated, no development, so that is not an issue.
This approach allows one to predict life stage over a large temperature range. Accuracy can be improved if one knows the lower development threshold. At high temperatures development stops, and a mortality function can be added.
Tim
-----Original Message-----
From: R-help <r-help-bounces at r-project.org> On Behalf Of Rolf Turner
Sent: Wednesday, July 27, 2022 8:26 PM
To: r-help <r-help at r-project.org>
Subject: [R] Predicted values from glm() when linear predictor is NA.
[External Email]
I have a data frame with a numeric ("TrtTime") and a categorical
("Lifestage") predictor.
Level "L1" of Lifestage occurs only with a single value of TrtTime, explicitly 12, whence it is not possible to estimate a TrtTime "slope"
when Lifestage is "L1".
Indeed, when I fitted the model
fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage, family=binomial,
data=demoDat)
I got:
That is, TrtTime:LifestageL1 is NA, as expected.
I would have thought that fitted or predicted values corresponding to Lifestage = "L1" would thereby be NA, but this is not the case:
That is, the predicted values on the scale of the linear predictor are large and positive, rather than being NA.
What this amounts to, it seems to me, is saying that if the linear predictor in a Binomial glm is NA, then "success" is a certainty.
This strikes me as being a dubious proposition. My gut feeling is that misleading results could be produced.
Can anyone explain to me a rationale for this behaviour pattern?
Is there some justification for it that I am not currently seeing?
Any other comments? (Please omit comments to the effect of "You are as thick as two short planks!". :-) )
I have attached the example data set in a file "demoDat.txt", should anyone want to experiment with it. The file was created using dput() so you should access it (if you wish to do so) via something like
demoDat <- dget("demoDat.txt")
Thanks for any enlightenment.
cheers,
Rolf Turner
--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
I have a data frame with a numeric ("TrtTime") and a categorical
("Lifestage") predictor.
Level "L1" of Lifestage occurs only with a single value of TrtTime,
explicitly 12, whence it is not possible to estimate a TrtTime "slope"
when Lifestage is "L1".
Indeed, when I fitted the model
fit <- glm(cbind(Dead,Alive) ~ TrtTime*Lifestage, family=binomial,
data=demoDat)
I got:
That is, TrtTime:LifestageL1 is NA, as expected.
I would have thought that fitted or predicted values corresponding to
Lifestage = "L1" would thereby be NA, but this is not the case:
That is, the predicted values on the scale of the linear predictor are
large and positive, rather than being NA.
What this amounts to, it seems to me, is saying that if the linear
predictor in a Binomial glm is NA, then "success" is a certainty.
This strikes me as being a dubious proposition. My gut feeling is that
misleading results could be produced.
The NA is most likely caused by aliasing, so some other combination of
factors a perfect surrogate for every case with that level of the
interaction. The `predict.glm` function always requires a complete set
of values to construct a case. Whether apparent incremental linear
prediction of that interaction term is large or small will depend on the
degree of independent contribution of the surrogate levels of other
variables..
David.
Can anyone explain to me a rationale for this behaviour pattern?
Is there some justification for it that I am not currently seeing?
Any other comments? (Please omit comments to the effect of "You are as
thick as two short planks!". :-) )
I have attached the example data set in a file "demoDat.txt", should
anyone want to experiment with it. The file was created using dput() so
you should access it (if you wish to do so) via something like
demoDat <- dget("demoDat.txt")
Thanks for any enlightenment.
cheers,
Rolf Turner
Hi Rolf,
that's an interesting observation.
I agree that it is counter-intuitive that the fitted values /
predictions are not NA.
That gives me some comfort!
Does anyone from R-core feel like commenting?
However, I don't agree with your comment that <<if the linear
predictor in a Binomial glm is NA, then "success" is a certainty>> -
that seems to be a peculiarity of these data - note for these three
observations there are thousands dead and zero alive and that you get
similar outcomes if you omit TrtTime altogether ...
Hmm. Yeah. I was probably jumping the gun a bit there. I'll think on
this a bit more.
This issue arose in a more complicated context (which I won't try to go
into) in which I really need some insight into how to deal with missing
values of the linear predictor, and what the resulting fitted values
actually mean. Psigh!
cheers,
Rolf
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
Time is often used in this sort of problem, but really time is not
relevant. A better choice is accumulated thermal units. The insect
will molt when X thermal units have been accumulated. This is often
expressed as degree days, but could as easily be other units like
degree seconds. However, I suspect that fine time units exceeds the
accuracy of the measurement system. A growth chamber might maintain
28 C, but the temperature the insect experiences might be somewhat
different thereby introducing additional variability in the outcome.
No thermal units accumulated, no development, so that is not an
issue. This approach allows one to predict life stage over a large
temperature range. Accuracy can be improved if one knows the lower
development threshold. At high temperatures development stops, and a
mortality function can be added.
Very cogent comments in respect of dealing with the underlying
practical problem, but I am not so much concerned with the practical
problem at the moment but rather with the workings of the software that
I am using.
cheers,
Rolf
P.S. I am at several removes from the data set(s) that I am messing
about with. But if my understanding is correct (always an assumption
of which to be sceptical!) these data were collected with the
temperature being held *constant*, whence time and accumulated thermal
units would be equivalent. Is it not so?
R.
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
David Winsemius <dwinsemius at comcast.net> wrote:
<SNIP>
The NA is most likely caused by aliasing, so some other combination
of factors a perfect surrogate for every case with that level of the
interaction.
<SNIP>
No, I think it's much simpler than that. Essentially it boils down to
the fact that if y is 1:10 and x is rep(1,10) then
lm(y ~ x)
gives a slope estimate of NA. As it clearly should.
cheers,
Rolf
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276
Sadly, I see that your practical options are limited. In regards to time and thermal units being equivalent if temperature is constant. The practical answer is yes. The technical answer is no because the growth chamber is not able to keep temperature constant and the insect can moderate the experienced temperature by moving to different parts of the plant (assuming this used plants and not artificial diet or leaf disks). Temperature might be more constant if these are mosquito larvae in vats of water. The water would have thermal mass to even out temperature fluctuations in the growth chamber.
Often people assume that a growth chamber set to 25C is always 25C. Data show this is not the case even in high end or custom made growth chambers.
Using time as a surrogate for accumulated thermal units under constant temperature will introduce some error in time due to non-uniform and non-constant temperatures within the growth chamber. Mostly the data are not present to document this let alone include this in the model. So we take the biologists view of statistics and ignore the problem that we cannot solve.
If I measure time in days, then it makes sense that I can have egg hatch at time 0. However, it is not biologically possible to have anything happen in zero time. If my accuracy in time measurement is "days" then maybe I should consider introducing some small time value for egg hatch. Say eggs hatch at time 0.2 days. That is below the resolution of my data but reconciles the biological impossibility of anything happening in zero time. I note that time is integer in the data set.
Another odd thing in the data. I assume that the values for "alive" represent the number of living individuals at different time intervals. At the end of the "Alive" data I note that there are a few time intervals where none are alive followed by a living individual. I am not a fan of zombies, or raising the dead (insects). Either the definition of dead is more Monty Python (https://www.youtube.com/watch?v=Jdf5EXo6I68), or I don't quite understand the data.
Tim
-----Original Message-----
From: Rolf Turner <r.turner at auckland.ac.nz>
Sent: Wednesday, July 27, 2022 10:10 PM
To: Ebert,Timothy Aaron <tebert at ufl.edu>
Cc: r-help <r-help at r-project.org>
Subject: Re: [R] Predicted values from glm() when linear predictor is NA.
[External Email]
On Thu, 28 Jul 2022 00:42:51 +0000
"Ebert,Timothy Aaron" <tebert at ufl.edu> wrote:
Time is often used in this sort of problem, but really time is not
relevant. A better choice is accumulated thermal units. The insect
will molt when X thermal units have been accumulated. This is often
expressed as degree days, but could as easily be other units like
degree seconds. However, I suspect that fine time units exceeds the
accuracy of the measurement system. A growth chamber might maintain
28 C, but the temperature the insect experiences might be somewhat
different thereby introducing additional variability in the outcome.
No thermal units accumulated, no development, so that is not an issue.
This approach allows one to predict life stage over a large
temperature range. Accuracy can be improved if one knows the lower
development threshold. At high temperatures development stops, and a
mortality function can be added.
Very cogent comments in respect of dealing with the underlying practical problem, but I am not so much concerned with the practical problem at the moment but rather with the workings of the software that I am using.
cheers,
Rolf
P.S. I am at several removes from the data set(s) that I am messing about with. But if my understanding is correct (always an assumption of which to be sceptical!) these data were collected with the temperature being held *constant*, whence time and accumulated thermal units would be equivalent. Is it not so?
R.
--
Honorary Research Fellow
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276