GAM model selection and dropping terms based on GCV - R-help

aditya gangadharan · 2006-12-04T12:30:33Z

Hello, I have a question regarding model selection and dropping of terms for GAMs fitted with package mgcv. I am following the approach suggested in Wood (2001), Wood and Augustin (2002). I fitted a saturated model, and I find from the plots that for two of the covariates, 1. The confidence interval includes 0 almost everywhere 2. The degrees of freedom are NOT close to 1 3. The partial residuals from plot.gam don?t show much pattern visually (to me) 4. When I drop either or both of the terms,

Simon Wood

Mon, Dec 4, 2006 6:14 AM #

On Monday 04 December 2006 12:30, aditya gangadharan wrote:

- I'm not sure that there is really an answer to this. GCV  is based on 
minimizing some approximation to the expected prediction error of the model. 
So to answer the question you'd need to do something like decide how much 
increase from `optimal' prediction error you would be prepared to tolerate. 
I think that it's not all that easy to come up with a nice way of blending  
prediction error based approaches to model selection, with approaches based 
on finding a model that is somehow the simplest model consistent with the 
data (but perhaps other people will comment on this). 

- That said, there is certainly an issue relating to the fact that the GCV 
score (or AIC, in fact) is rather asymmetric, so that random variability in 
the score tends to lead more readily to overfitting than to underfitting. 
This suggests that in fact prediction error performance at finite sample 
sizes may be improved by shrinking the smoothing parameters themselves. With 
`mgcv::gam' you can do this by increasing the `gamma' parameter above it's 
default value, which favours smoother models by making each model degree of 
freedom count as gamma degrees of freedom in the GCV score (or AIC/UBRE). It 
is possible to choose `gamma' by e.g. 10-fold cross-validation, but that 
requires some coding.

- There are more discussions of GAM model selection in various mgcv help files 
and my book. See help("mgcv-package") for details of which pages, and the 
reference. 

My bottom line on model seelction is to use things like GCV, AIC, confidence 
interval coverage and approximate p-values for guidance, but not as the basis 
for rules... modelling context has to play a part as well. 

Sorry if that's all a bit vague.

Simon

> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY UK
> +44 1225 386603  www.maths.bath.ac.uk/~sw283