Hi David,
1) You cannot perform likelihood-based model comparisons with bam models, or -- for completeness' sake -- with gam models that were fitted using performance iteration or the EFS optimizer. All of these are based on PQL (penalized quasi-likelihood), which makes the log-likelihood (and hence LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood (2017:149-151). gam() with the default outer iteration should be fine, though. Have you tried fitting your full model using bam with the select=TRUE argument to turn on mgcv's automatic smooth-term selection?
2) I am unsure if the deviance explained is or is not suitable for indicating effect size, so I can't comment on this question. I might, however, have an alternative suggestion: have you considered partial eta squared or partial omega squared? You should be able to calculate those based on the ANOVA table.
3) I agree with you that the warning suggests complete separation, but in my experience this doesn't automatically have to be a problem. Have you checked the summary for extremely large beta values, and also have you run gam.check() to see if your fit looks reasonable? If neither indicates a problem I wouldn't be too concerned about it.
Hope this helps,
Cesko
P.S.: please send messages in plain text only, as you can see the formatting of your message was slightly screwed up because the mailing list automatically strips HTML markup
-----Oorspronkelijk bericht-----
Van: R-sig-mixed-models <r-sig-mixed-models-bounces at r-project.org> Namens David Villegas R?os
Verzonden: zaterdag 1 februari 2020 19:57
Aan: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Onderwerp: [R-sig-ME] bam model selection with 3 million data
Dear list,
I?m investigating the effect of three variables (X, Y, Z) on the probability that an animal uses a particular habitat A. I have a time series of relocations for each animal (>300 individuals), with one relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects of the three variables are expected to be non-linear so I?m using gam models. My dataset is very large, with >3 million data points so I?m using the bam function from the mgcv library in R. In my models I include a random effect ?individual ID?, and a temporal autocorrelation term that corrects much but not all of the autocorrelation in the models.
*Question 1.*
When I run a model with the three main effects (X, Y, Z) and the three double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly significant, except for one interaction. If I remove it, then everything is highly significant. However, I also wanted to run simpler models with only one interaction, no interactions, only two main effects and only one main effect. Then, if I compare all these models with AIC or BIC, I get that the best model (by far) is the one with only main effects.
AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)
df AIC
codcoaAR2 306.1310 -1442543
codcoaAR2.1 293.1608 -1440642
codcoaAR2.2 292.9615 -1438219
codcoaAR2.3 294.3657 -1435346
codcoaAR2.4 284.0026 -1434286
codcoaAR2.5 280.3472 -1396765
codcoaAR2.6 279.6380 -1435862
codcoaAR2.7 269.4968 -1377806
codcoaAR2.8 269.0480 -1393897
codcoaAR2.9 281.8584 -1214270
codcoaAR 271.7066 -2353481 # model with only main effects
I wonder how this is possible if two of the interactions are highly significant.
So my underlying question is: *for a model like this in which sample size is huge, should I make model selection looking at the significance of the different terms in the model, or should I rather look at AIC/BIC?*
*Question 2.*
Let?s assume the model with only main effects is indeed the optimal one.
Then I?d like to get the effect size of each explanatory variable. It?s not clear to me how to do it even after reading some post on this and other forums, but I tried to figure it out by sequentially running the model without one explanatory variable at a time, and then comparing the deviance explained in the optimal model with X, Y, Z with the deviance explained with the reduced model with only Y and Z, for instance. Assuming that the difference would the variance explained by X. *Is this correct? *Looking at the results, the deviance explained by each variable X, Y, Z is quite low, but if the three main effects explain so little variance, who is explaining the rest?
Model
Deviance explained
X, Z, Y
69.3%
Y, Z
68.5%
X, Z
69.3%
X, Y
60.5%
*Question 3.*
In my models I usually get this error message:
Warning message:
In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef, :
fitted probabilities numerically 0 or 1 occurred
which seems to indicate that there is perfect separation in my logistic regression. I?m not sure this is the case in my data, how could I check it and correct for it if needed? Should it be always corrected?
Thanks for your help,
David
_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models