Hi David,
I wrote that package, so yes, I am very familiar with it :) :)
buildbam is based on the explained deviance. A term is included if the
explained deviance with the term is higher than the explained deviance
without it. This means that buildbam will favor models that contain (too?)
many effects, as effects that model only noise but still explain perhaps
0.0001% more deviance due to chance will still be included. To be honest
with you, I have come to believe that the buildbam function was a mistake
in the first place. I had coded buildgam already and buildbam was an easy
extension, and it was only later that I realized that bam using PQL meant
that only the explained deviance would be a valid model-comparison
criterion. But note that the explained deviance is not actually a _formal_
criterion, like the likelihood-ratio test is (which uses a well-known
result that differences in log-likelihood are asymptotically
chi-square-distributed) or like AIC or BIC (which are respectively based on
information theory and on Bayesian inference). I had wanted to remove
buildbam for a long time, but this would be wrong of me to do because now
that it's out there people may be using it, and given the provided caveat
that it is can only use the explained deviance, they may actually have a
use case for it. Your message makes me think very seriously about
officially deprecating this function in buildmer's next release, or at
least issue a warning that explained deviance is not a formal criterion and
will favor overfitted models. In fact, I should deprecate that as well -- I
had only put it in for buildbam's use.
If buildbam and bam(select=TRUE) conflict, I would definitely trust
bam(select=TRUE) over buildbam, as mgcv's automatic smoothness selection is
a method derived from first principles by extending the approach used to
fit smooth terms anyway in a very straightforward way. On the other hand,
the explained deviance is completely ad hoc and has no formal
justification. I'll just bite the bullet and deprecate buildbam, advising
people to use buildgam or select=TRUE instead.
Thanks for helping me bite that bullet,
Cesko
Op 4-2-2020 om 10:51 schreef David Villegas R?os:
Thanks Cesko.
I tried your suggestion (using the select argument) and got to a
reasonable optimal model in which I keep all the main effects and
interactions except for one main effect.
I tried, however, to do model selection using the buildbam function
(buildmer library) and I get a different optimal model, where all effects
are highly significant. I?m not sure why. According to the buildmer package
documentation:
"As bam uses PQL, only crit='deviance' is supported."
where
crit=Character string or vector determining the criterion used to test
terms for elimination. Possible options are 'LRT' (likelihood-ratio test;
this is the default), 'LL' (use the raw -2 log likelihood), 'AIC' (Akaike
Information Criterion), 'BIC' (Bayesian Information Criterion), and
'deviance' (explained deviance ? note that this is not a formal test)
I?m not sure if you are familiar with buildmer package though.
Thanks,
David
El lun., 3 feb. 2020 a las 13:09, Voeten, C.C. (<
c.c.voeten at hum.leidenuniv.nl <mailto:c.c.voeten at hum.leidenuniv.nl>>)
escribi?:
Hi David,
Please keep the mailing list in cc.
Yes, gam() can indeed be prohibitively slow, hence why I suggested
the select=TRUE approach. I hope the below explanation helps.
Smooth terms are composed of a null space and a range space, which
respectively represent the completely smooth part of the effect and the
wiggly part of the effect. In the mathematical representation, these are
equivalent to fixed effects and random effects, respectively. As such, the
range space is subject to penalization and can shrink to zero. In GAM
terms, this would correspond to a smoothing parameter tending to infinity,
resulting in a completely smooth effect: you would be left only with the
null space. What select=TRUE does is add an additional penalty to all null
spaces, so that these too can shrink towards zero. Effects which are shrunk
to (near) zero in this way can then be removed from the model by the user.
The degree of shrinkage can be seen by looking at the edf. Smooths
that are (near) zero will also use (near) zero edf. In practice, when I do
model selection using select=TRUE, I always remove terms whose edf < 1.
Then I fit the model again with select=TRUE to ensure that no further
reduction is necessary, and then I fit a final model with select=FALSE.
Best,
Cesko
-----Oorspronkelijk bericht-----
Van: David Villegas R?os <chirleu at gmail.com <mailto:
Verzonden: maandag 3 februari 2020 11:28
Aan: Voeten, C.C. <c.c.voeten at hum.leidenuniv.nl <mailto:
c.c.voeten at hum.leidenuniv.nl>>
Onderwerp: Re: [R-sig-ME] bam model selection with 3 million data
Thanks Cesko, really appreciate your answer.
In my particular case however, I cannot run gam models (they take
forever to run) so I?m still struggling on how to perform model selection
in bam. I tried select=TRUE and the summary changed a bit, but I don?t
really know what this argument is doing behind the scenes, or whether I
should trust the summary from the model using select=TRUE rather than the
default select=FALSE.
Best wishes,
David
El s?b., 1 feb. 2020 a las 20:39, Voeten, C.C. (<mailto:
c.c.voeten at hum.leidenuniv.nl <mailto:c.c.voeten at hum.leidenuniv.nl>>)
escribi?:
Hi David,
1) You cannot perform likelihood-based model comparisons with bam
models, or -- for completeness' sake -- with gam models that were fitted
using performance iteration or the EFS optimizer. All of these are based on
PQL (penalized quasi-likelihood), which makes the log-likelihood (and hence
LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood
(2017:149-151). gam() with the default outer iteration should be fine,
though. Have you tried fitting your full model using bam with the
select=TRUE argument to turn on mgcv's automatic smooth-term selection?
2) I am unsure if the deviance explained is or is not suitable for
indicating effect size, so I can't comment on this question. I might,
however, have an alternative suggestion: have you considered partial eta
squared or partial omega squared? You should be able to calculate those
based on the ANOVA table.
3) I agree with you that the warning suggests complete separation,
but in my experience this doesn't automatically have to be a problem. Have
you checked the summary for extremely large beta values, and also have you
run gam.check() to see if your fit looks reasonable? If neither indicates a
problem I wouldn't be too concerned about it.
Hope this helps,
Cesko
P.S.: please send messages in plain text only, as you can see the
formatting of your message was slightly screwed up because the mailing list
automatically strips HTML markup
-----Oorspronkelijk bericht-----
Van: R-sig-mixed-models <mailto:
r-sig-mixed-models-bounces at r-project.org <mailto:
r-sig-mixed-models-bounces at r-project.org>> Namens David Villegas R?os
Verzonden: zaterdag 1 februari 2020 19:57
Aan: r-sig-mixed-models <mailto:r-sig-mixed-models at r-project.org
<mailto:r-sig-mixed-models at r-project.org>>
Onderwerp: [R-sig-ME] bam model selection with 3 million data
Dear list,
I?m investigating the effect of three variables (X, Y, Z) on the
probability that an animal uses a particular habitat A. I have a time
series of relocations for each animal (>300 individuals), with one
relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The
effects of the three variables are expected to be non-linear so I?m using
gam models. My dataset is very large, with >3 million data points so I?m
using the bam function from the mgcv library in R. In my models I include a
random effect ?individual ID?, and a temporal autocorrelation term that
corrects much but not all of the autocorrelation in the models.
*Question 1.*
When I run a model with the three main effects (X, Y, Z) and the
three double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly
significant, except for one interaction. If I remove it, then everything is
highly significant. However, I also wanted to run simpler models with only
one interaction, no interactions, only two main effects and only one main
effect. Then, if I compare all these models with AIC or BIC, I get that the
best model (by far) is the one with only main effects.
AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)
df AIC
codcoaAR2 306.1310 -1442543
codcoaAR2.1 293.1608 -1440642
codcoaAR2.2 292.9615 -1438219
codcoaAR2.3 294.3657 -1435346
codcoaAR2.4 284.0026 -1434286
codcoaAR2.5 280.3472 -1396765
codcoaAR2.6 279.6380 -1435862
codcoaAR2.7 269.4968 -1377806
codcoaAR2.8 269.0480 -1393897
codcoaAR2.9 281.8584 -1214270
codcoaAR 271.7066 -2353481 # model with only main effects
I wonder how this is possible if two of the interactions are highly
So my underlying question is: *for a model like this in which sample
size is huge, should I make model selection looking at the significance of
the different terms in the model, or should I rather look at AIC/BIC?*
*Question 2.*
Let?s assume the model with only main effects is indeed the optimal
Then I?d like to get the effect size of each explanatory variable.
It?s not clear to me how to do it even after reading some post on this and
other forums, but I tried to figure it out by sequentially running the
model without one explanatory variable at a time, and then comparing the
deviance explained in the optimal model with X, Y, Z with the deviance
explained with the reduced model with only Y and Z, for instance. Assuming
that the difference would the variance explained by X. *Is this correct?
*Looking at the results, the deviance explained by each variable X, Y, Z is
quite low, but if the three main effects explain so little variance, who is
explaining the rest?
Model
Deviance explained
X, Z, Y
69.3%
Y, Z
68.5%
X, Z
69.3%
X, Y
60.5%
*Question 3.*
In my models I usually get this error message:
Warning message:
In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef =
fitted probabilities numerically 0 or 1 occurred
which seems to indicate that there is perfect separation in my
logistic regression. I?m not sure this is the case in my data, how could I
check it and correct for it if needed? Should it be always corrected?
Thanks for your help,
David
[[alternative HTML version deleted]]