bam model selection with 3 million data

Hi David,

1) You cannot perform likelihood-based model comparisons with bam models, or -- for completeness' sake -- with gam models that were fitted using performance iteration or the EFS optimizer. All of these are based on PQL (penalized quasi-likelihood), which makes the log-likelihood (and hence LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood (2017:149-151). gam() with the default outer iteration should be fine, though. Have you tried fitting your full model using bam with the select=TRUE argument to turn on mgcv's automatic smooth-term selection?

2) I am unsure if the deviance explained is or is not suitable for indicating effect size, so I can't comment on this question. I might, however, have an alternative suggestion: have you considered partial eta squared or partial omega squared? You should be able to calculate those based on the ANOVA table.

3) I agree with you that the warning suggests complete separation, but in my experience this doesn't automatically have to be a problem. Have you checked the summary for extremely large beta values, and also have you run gam.check() to see if your fit looks reasonable? If neither indicates a problem I wouldn't be too concerned about it.

Hope this helps,

Cesko

P.S.: please send messages in plain text only, as you can see the formatting of your message was slightly screwed up because the mailing list automatically strips HTML markup

-----Oorspronkelijk bericht-----
Van: R-sig-mixed-models <r-sig-mixed-models-bounces at r-project.org> Namens David Villegas R?os
Verzonden: zaterdag 1 februari 2020 19:57
Aan: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Onderwerp: [R-sig-ME] bam model selection with 3 million data

Dear list,

I?m investigating the effect of three variables (X, Y, Z) on the probability that an animal uses a particular habitat A. I have a time series of relocations for each animal (>300 individuals), with one relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects of the three variables are expected to be non-linear so I?m using gam models. My dataset is very large, with >3 million data points so I?m using the bam function from the mgcv library in R. In my models I include a random effect ?individual ID?, and a temporal autocorrelation term that corrects much but not all of the autocorrelation in the models.

*Question 1.*

When I run a model with the three main effects (X, Y, Z) and the three double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly significant, except for one interaction. If I remove it, then everything is highly significant. However, I also wanted to run simpler models with only one interaction, no interactions, only two main effects and only one main effect. Then, if I compare all these models with AIC or BIC, I get that the best model (by far) is the one with only main effects.

bam model selection with 3 million data

Thread (2 messages)