gam variable selection - R-SIG-ecology

Mon, Sep 26, 2011 11:54 PM #

Dear list,

I am studying the influence of several environmental factors (numeric &
dummies) on species densities (= numeric) using the gam()
function with a gaussian link function in the mgcv package. As stated in 
Wood (2006) there is no variable selection algorithm.

Is it an appropriate (iterative) approach to drop the predictor being
least significant (eg. p > 0.05), refit the model, compare the GCV/AIC
score and so forth. Should I first focus on the smoothing functions or 
fixed effects? Or is such a distinction not important at all?

Perhaps someone has more experience with GAMs and can give me a helping
hand? Thanks in advance!

Best
Marco

Marco Helbich
Department of Geography
University of Heidelberg

Gavin Simpson

Tue, Sep 27, 2011 2:40 AM #

On Tue, 2011-09-27 at 08:54 +0200, Marco Helbich wrote:

You could do that, but I would be sceptical of the results.

Marra and Wood (2011, Computational Statistics and Data Analysis 55;
2372-2387) compare various approaches for feature selection in GAMs.
IIRC, they concluded that an additional penalty term in the smoothness
selection procedure gave the best results. This can be activated in
mgcv::gam() by using the `select = TRUE` argument/setting.

HTH

G

%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

Marco Helbich

Tue, Sep 27, 2011 4:42 AM #

Gavin,

thank you for your reply, I appreciate it!

After consulting the proposed paper, I have tried your suggestion 
setting "select = T", which results again in another question:

If the p-value is "NA" does this mean that the smoothing term is droped 
(or shrank to zero)? Independent of its high edf, is this predictor 
(e.g. s(x1)) not relevant to explain y?

E.g.:
                     edf    Ref.df      F p-value
s(x1)   7.521e-09 1.402e-08  0.000      NA
s(x2)    5.408e+00 6.448e+00  3.049 0.00462 **
s(x3)    6.287e-09 1.217e-08  0.000      NA
s(x4)    2.152e+00 2.754e+00  5.037 0.00248 **

Best
Marco


Am 27.09.2011 11:40, schrieb Gavin Simpson:

Gavin Simpson

Tue, Sep 27, 2011 4:50 AM #

On Tue, 2011-09-27 at 13:42 +0200, Marco Helbich wrote:

Those NA terms are ones that have effectively been penalised out of the
model - the EDF are effectively zero for these terms and they explain no
variance in the response. These predictors s(x1) and s(x4) appear to
have no relationships with y.

You should also check out if there is concurvity - the multi
collinearity problem but for additive models. There is a function in
mgcv to see if this is a problem or not.

HTH

G

%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%

Marco Helbich

Tue, Sep 27, 2011 5:40 AM #

thank you for clarifying.
so I can remove them all at once.

best
marco

Am 27.09.2011 13:50, schrieb Gavin Simpson:

On Tue, 2011-09-27 at 13:42 +0200, Marco Helbich wrote:

Gavin,

thank you for your reply, I appreciate it!

After consulting the proposed paper, I have tried your suggestion
setting "select = T", which results again in another question:

If the p-value is "NA" does this mean that the smoothing term is droped
(or shrank to zero)? Independent of its high edf, is this predictor
(e.g. s(x1)) not relevant to explain y?

Those NA terms are ones that have effectively been penalised out of the
model - the EDF are effectively zero for these terms and they explain no
variance in the response. These predictors s(x1) and s(x4) appear to
have no relationships with y.

You should also check out if there is concurvity - the multi
collinearity problem but for additive models. There is a function in
mgcv to see if this is a problem or not.

HTH

G

E.g.:
                      edf    Ref.df      F p-value
s(x1)   7.521e-09 1.402e-08  0.000      NA
s(x2)    5.408e+00 6.448e+00  3.049 0.00462 **
s(x3)    6.287e-09 1.217e-08  0.000      NA
s(x4)    2.152e+00 2.754e+00  5.037 0.00248 **

Best
Marco


Am 27.09.2011 11:40, schrieb Gavin Simpson:

On Tue, 2011-09-27 at 08:54 +0200, Marco Helbich wrote:

Dear list,

I am studying the influence of several environmental factors (numeric&
dummies) on species densities (= numeric) using the gam()
function with a gaussian link function in the mgcv package. As stated in
Wood (2006) there is no variable selection algorithm.

Is it an appropriate (iterative) approach to drop the predictor being
least significant (eg. p>   0.05), refit the model, compare the GCV/AIC
score and so forth. Should I first focus on the smoothing functions or
fixed effects? Or is such a distinction not important at all?

Perhaps someone has more experience with GAMs and can give me a helping
hand? Thanks in advance!

You could do that, but I would be sceptical of the results.

Marra and Wood (2011, Computational Statistics and Data Analysis 55;
2372-2387) compare various approaches for feature selection in GAMs.
IIRC, they concluded that an additional penalty term in the smoothness
selection procedure gave the best results. This can be activated in
mgcv::gam() by using the `select = TRUE` argument/setting.

HTH

G

Best
Marco

Gavin Simpson

Tue, Sep 27, 2011 10:10 AM #

On Tue, 2011-09-27 at 14:40 +0200, Marco Helbich wrote:

Given their effects are already removed you could just work with the
model *as is*. If you refit, you might have to be careful to ensure that
the same model (and smooth complexities) are selected when the redundant
variables do not take part in any of the fitting.

Just be careful to check the model with and without the redundant terms
really is the same.

G

best
marco

Am 27.09.2011 13:50, schrieb Gavin Simpson:

On Tue, 2011-09-27 at 13:42 +0200, Marco Helbich wrote:

Gavin,

thank you for your reply, I appreciate it!

After consulting the proposed paper, I have tried your suggestion
setting "select = T", which results again in another question:

If the p-value is "NA" does this mean that the smoothing term is droped
(or shrank to zero)? Independent of its high edf, is this predictor
(e.g. s(x1)) not relevant to explain y?

Those NA terms are ones that have effectively been penalised out of the
model - the EDF are effectively zero for these terms and they explain no
variance in the response. These predictors s(x1) and s(x4) appear to
have no relationships with y.

You should also check out if there is concurvity - the multi
collinearity problem but for additive models. There is a function in
mgcv to see if this is a problem or not.

HTH

G

E.g.:
                      edf    Ref.df      F p-value
s(x1)   7.521e-09 1.402e-08  0.000      NA
s(x2)    5.408e+00 6.448e+00  3.049 0.00462 **
s(x3)    6.287e-09 1.217e-08  0.000      NA
s(x4)    2.152e+00 2.754e+00  5.037 0.00248 **

Best
Marco


Am 27.09.2011 11:40, schrieb Gavin Simpson:

On Tue, 2011-09-27 at 08:54 +0200, Marco Helbich wrote:

Dear list,

I am studying the influence of several environmental factors (numeric&
dummies) on species densities (= numeric) using the gam()
function with a gaussian link function in the mgcv package. As stated in
Wood (2006) there is no variable selection algorithm.

Is it an appropriate (iterative) approach to drop the predictor being
least significant (eg. p>   0.05), refit the model, compare the GCV/AIC
score and so forth. Should I first focus on the smoothing functions or
fixed effects? Or is such a distinction not important at all?

Perhaps someone has more experience with GAMs and can give me a helping
hand? Thanks in advance!

You could do that, but I would be sceptical of the results.

Marra and Wood (2011, Computational Statistics and Data Analysis 55;
2372-2387) compare various approaches for feature selection in GAMs.
IIRC, they concluded that an additional penalty term in the smoothness
selection procedure gave the best results. This can be activated in
mgcv::gam() by using the `select = TRUE` argument/setting.

HTH

G

Best
Marco

%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Dr. Gavin Simpson             [t] +44 (0)20 7679 0522
 ECRC, UCL Geography,          [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%