bam model selection with 3 million data

Hi David,

Please keep the mailing list in cc.

Yes, gam() can indeed be prohibitively slow, hence why I suggested the select=TRUE approach. I hope the below explanation helps.

Smooth terms are composed of a null space and a range space, which respectively represent the completely smooth part of the effect and the wiggly part of the effect. In the mathematical representation, these are equivalent to fixed effects and random effects, respectively. As such, the range space is subject to penalization and can shrink to zero. In GAM terms, this would correspond to a smoothing parameter tending to infinity, resulting in a completely smooth effect: you would be left only with the null space. What select=TRUE does is add an additional penalty to all null spaces, so that these too can shrink towards zero. Effects which are shrunk to (near) zero in this way can then be removed from the model by the user.

The degree of shrinkage can be seen by looking at the edf. Smooths that are (near) zero will also use (near) zero edf. In practice, when I do model selection using select=TRUE, I always remove terms whose edf < 1. Then I fit the model again with select=TRUE to ensure that no further reduction is necessary, and then I fit a final model with select=FALSE.

Best,
Cesko

-----Oorspronkelijk bericht-----
Van: David Villegas R?os <chirleu at gmail.com> 
Verzonden: maandag 3 februari 2020 11:28
Aan: Voeten, C.C. <c.c.voeten at hum.leidenuniv.nl>
Onderwerp: Re: [R-sig-ME] bam model selection with 3 million data

Thanks Cesko, really appreciate your answer.
In my particular case however, I cannot run gam models (they take forever to run) so I?m still struggling on how to perform model selection in bam. I tried select=TRUE and the summary changed a bit, but I don?t really know what this argument is doing behind?the scenes, or whether I should trust the summary from the model using select=TRUE rather than the default select=FALSE.
Best wishes,?
David

El s?b., 1 feb. 2020 a las 20:39, Voeten, C.C. (<mailto:c.c.voeten at hum.leidenuniv.nl>) escribi?:
Hi David,

1) You cannot perform likelihood-based model comparisons with bam models, or -- for completeness' sake -- with gam models that were fitted using performance iteration or the EFS optimizer. All of these are based on PQL (penalized quasi-likelihood), which makes the log-likelihood (and hence LRT, AIC, BIC, etc) invalid for comparison purposes. See Wood (2017:149-151). gam() with the default outer iteration should be fine, though. Have you tried fitting your full model using bam with the select=TRUE argument to turn on mgcv's automatic smooth-term selection?

2) I am unsure if the deviance explained is or is not suitable for indicating effect size, so I can't comment on this question. I might, however, have an alternative suggestion: have you considered partial eta squared or partial omega squared? You should be able to calculate those based on the ANOVA table.

3) I agree with you that the warning suggests complete separation, but in my experience this doesn't automatically have to be a problem. Have you checked the summary for extremely large beta values, and also have you run gam.check() to see if your fit looks reasonable? If neither indicates a problem I wouldn't be too concerned about it.

Hope this helps,

Cesko

P.S.: please send messages in plain text only, as you can see the formatting of your message was slightly screwed up because the mailing list automatically strips HTML markup

-----Oorspronkelijk bericht-----
Van: R-sig-mixed-models <mailto:r-sig-mixed-models-bounces at r-project.org> Namens David Villegas R?os
Verzonden: zaterdag 1 februari 2020 19:57
Aan: r-sig-mixed-models <mailto:r-sig-mixed-models at r-project.org>
Onderwerp: [R-sig-ME] bam model selection with 3 million data

Dear list,

I?m investigating the effect of three variables (X, Y, Z) on the probability that an animal uses a particular habitat A. I have a time series of relocations for each animal (>300 individuals), with one relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects of the three variables are expected to be non-linear so I?m using gam models. My dataset is very large, with >3 million data points so I?m using the bam function from the mgcv library in R. In my models I include a random effect ?individual ID?, and a temporal autocorrelation term that corrects much but not all of the autocorrelation in the models.

*Question 1.*

When I run a model with the three main effects (X, Y, Z) and the three double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly significant, except for one interaction. If I remove it, then everything is highly significant. However, I also wanted to run simpler models with only one interaction, no interactions, only two main effects and only one main effect. Then, if I compare all these models with AIC or BIC, I get that the best model (by far) is the one with only main effects.