gam variable selection
Hi Marco, Having recently been working with gams myself I would suggest a procedure whereby you build your model in a forward stepwise approach first, having run individual gams for each of your variables and selecting the significant variable with the best AIC as your first variable, and iteratively trying out the other variables as 2nd in the gam, selecting the combination with the best AIC, and repeating until you get no further AIC improvement. I found it advisable to always first run each gam with all smooth functions applied (and with number of knots restricted to avoid overfitting the model using the term k=4 for 4 knots e.g. gam(x~s(y,k=4)+s(z,k=4), family=Gaussian)) then check the plots for each of your variables and rerun each model with linear functions applied as advised by the plots. Also remember to throw out significantly correlated variables once one of your correlates has been selected. The backwards stepwise model build could then be run to check the forwards build and using a global model that has excluded the thrown out correlates. Also worth knowing, but not worth relying on, is that there is a function called "dredge" which will run through your global model and list the potential model builds in order of best AIC. This is a variable selection algorithm but it does not take into account correlates or significance so it is best used only as advice and another check for a longhand build. All the best, Bex Research Assistant University of Plymouth -----Original Message----- From: r-sig-ecology-bounces at r-project.org [mailto:r-sig-ecology-bounces at r-project.org] On Behalf Of r-sig-ecology-request at r-project.org Sent: 27 September 2011 11:00 To: r-sig-ecology at r-project.org Subject: R-sig-ecology Digest, Vol 42, Issue 16 Send R-sig-ecology mailing list submissions to r-sig-ecology at r-project.org To subscribe or unsubscribe via the World Wide Web, visit https://stat.ethz.ch/mailman/listinfo/r-sig-ecology or, via email, send a message with subject or body 'help' to r-sig-ecology-request at r-project.org You can reach the person managing the list at r-sig-ecology-owner at r-project.org When replying, please edit your Subject line so it is more specific than "Re: Contents of R-sig-ecology digest..." Today's Topics: 1. gam variable selection (Marco Helbich) 2. Re: gam variable selection (Gavin Simpson) ---------------------------------------------------------------------- Message: 1 Date: Tue, 27 Sep 2011 08:54:52 +0200 From: Marco Helbich <marco.helbich at gmx.at> To: r-sig-ecology at r-project.org Subject: [R-sig-eco] gam variable selection Message-ID: <4E81733C.8090700 at gmx.at> Content-Type: text/plain; charset=ISO-8859-15; format=flowed Dear list, I am studying the influence of several environmental factors (numeric & dummies) on species densities (= numeric) using the gam() function with a gaussian link function in the mgcv package. As stated in Wood (2006) there is no variable selection algorithm. Is it an appropriate (iterative) approach to drop the predictor being least significant (eg. p > 0.05), refit the model, compare the GCV/AIC score and so forth. Should I first focus on the smoothing functions or fixed effects? Or is such a distinction not important at all? Perhaps someone has more experience with GAMs and can give me a helping hand? Thanks in advance! Best Marco -- Marco Helbich Department of Geography University of Heidelberg ------------------------------ Message: 2 Date: Tue, 27 Sep 2011 10:40:27 +0100 From: Gavin Simpson <gavin.simpson at ucl.ac.uk> To: Marco Helbich <marco.helbich at gmx.at> Cc: r-sig-ecology at r-project.org Subject: Re: [R-sig-eco] gam variable selection Message-ID: <1317116427.2714.3.camel at chrysothemis.geog.ucl.ac.uk> Content-Type: text/plain; charset="UTF-8"
On Tue, 2011-09-27 at 08:54 +0200, Marco Helbich wrote:
Dear list, I am studying the influence of several environmental factors (numeric & dummies) on species densities (= numeric) using the gam() function with a gaussian link function in the mgcv package. As stated in Wood (2006) there is no variable selection algorithm. Is it an appropriate (iterative) approach to drop the predictor being least significant (eg. p > 0.05), refit the model, compare the GCV/AIC score and so forth. Should I first focus on the smoothing functions or fixed effects? Or is such a distinction not important at all? Perhaps someone has more experience with GAMs and can give me a helping hand? Thanks in advance!
You could do that, but I would be sceptical of the results. Marra and Wood (2011, Computational Statistics and Data Analysis 55; 2372-2387) compare various approaches for feature selection in GAMs. IIRC, they concluded that an additional penalty term in the smoothness selection procedure gave the best results. This can be activated in mgcv::gam() by using the `select = TRUE` argument/setting. HTH G
Best Marco
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ------------------------------ _______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology End of R-sig-ecology Digest, Vol 42, Issue 16