bam model selection with 3 million data

Hi David,

I've never used compareML myself, but after taking a quick look at the code it seems that it's just doing a likelihood-ratio test, which means that the same caveats apply. So if my understanding of this function is correct, it can be used with gam models fitted using outer iteration with method="ML" or method="REML" (which are based on penalized (restricted) maximum likelihood), but not with gam models fitted using GCV/UBRE/Cp (which are not likelihood-based), gam models fitted using non-default optimizers (these use PQL), or bam models (PQL). What I mean by this is that, while the numerical comparison of the optimized log-likelihood values may very well work for selecting the best model, the penalized quasilikelihood is not a true likelihood and hence cannot fall back on Wilks's theorem that the difference in log-likelihoods is chi-square-distributed. The analogous reasoning goes for AIC and BIC. But that doesn't mean that such comparisons are useless -- remember that all models are wrong, but some are useful.

Note that the only thing select=TRUE does is enable additional penalization of the smooths; it is up to you to determine how much shrinkage you deem enough to warrant removing a term from the model. So if you feel that your one main effect is important to you, you can always choose to leave it in. However, if you go this route, might I suggest taking a look at the parsimonious mixed models paper (https://arxiv.org/abs/1506.04967), where the bottom line is: if you have reasons to expect a term to be important or unimportant, why even bother with selection procedures and why not just fit the model that you believe represents your data the best? (In fact, I personally would only use stepwise-selection methods if I wasn't sure whether a term is or is not important, particularly with respect to achieving convergence...)

Best,
Cesko

Op 4-2-2020 om 12:30 schreef David Villegas R?os:

bam model selection with 3 million data

Thread (7 messages)