Skip to content
Prev 2406 / 7420 Next

gam variable selection

On Wed, 2011-09-28 at 10:44 +0100, Rebecca Ross wrote:
Forward selection/backward elimination should be avoided at all costs.
By doing this sort of inclusion/elimination you are introducing
selection bias into the "coefficients" for the terms in the models. You
are explicitly setting some to zero be excluding them from the model and
this biases the other "coefficients".

Many quantitative ecologists bemoan the continued use of step-wise
feature selection procedures by ecologists. See for example Whittingham
et al. (2006).

Marra and Wood (2011) look at a backwards elimination strategy as part
of their comparison of variable selection methods for GAMs. IIRC, it
performs badly.
By restricting the dimension of the basis functions to such low levels,
you are making an explicit statement about the forms of model that the
GAM can fit. This is fine if you have knowledge to guide this process,
say from previous work etc. that suggests such forms for the fitted
smooths, but if not, you are forcing a very restrictive set of models
that can be fitted.
This is an important point - known as concurvity in additive models.
mgcv has a function to compute some measures that might indicate the
presence of concurvity, but this is more involved than just looking for
correlated variables - note that linear correlation is not much use when
you are allowing for non-linear relationships between variables.
The idea here is that one can average the predictions from the set of
best candidate models - not use it as a means to find the best set of
predictors for a single model.

The paper by Marra and Wood (2011), whilst being somewhat technical in
places, is a excellent resource for comparing the various means
available for doing feature selection in GAMs. Whilst theirs is but one
study, the general result appears to be that adding an extra penalty to
the penalised regression solved by mgcv:::gam(), which allows variables
to be shrunk out of the model entirely, is a robust and powerful means
of identifying important features.

Couple this with fitting via REML or ML (not the default GCV) as GCV can
overfit and we now have very good guides as to how to perform feature
selection in, and fit, GAMs via the penalised regression approach of
Simon Wood as implemented in his mgcv package.

G

Refs:

Marra G., Wood S.N. (2011) Practical variable selection for generalized
additive models. Computational Statistics and Data Analysis 55;
2372-2387

Whittingham MJ, Stephens PA, Bradbury RB, Freckleton RP (2006) Why do we
still use step-wise modelling in ecology and behaviour? J Animal Ecol
75:1182-1189