An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-ecology/attachments/20110523/cace452b/attachment.pl>
Stepwise algorithm for GAM
3 messages · ARISTIDES LOPEZ, Zoltan Botta-Dukat, Gavin Simpson
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-ecology/attachments/20110524/17d68cf9/attachment.pl>
On Tue, 2011-05-24 at 07:45 +0200, Zoltan Botta-Dukat wrote:
Hi, there is no automatic variable selection in the mgcv package. You should remove the superfluous terms manually. You can choose them using ML-test , comparing AIC values or using plot function. An example: set.seed(3) n<-200 ## simulate data dat <- gamSim(1,n=n,scale=.15,dist="poisson") str(dat) ## spurious predictors dat$x4 <- runif(n, 0, 1) dat$x5 <- runif(n, 0, 1) b1<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4)+s(x5),data=dat,family=poisson) # full model summary(b1) # you can choose superfluous predictors based on this output b2<-gam(y~s(x0)+s(x1)+s(x2)+s(x3)+s(x4),data=dat, family=poisson) # reduced model without x5 anova(b,b2,test="Chisq") # comparing the two models
anova(b1, b2, test = "Chisq")
plot(b1,pages=1) # smooth function is a nearly horizontal line for superfluous predictors Setting select=T may give more clear pattern, however in my toy-example the difference is small.
Small difference in terms of the selected model perhaps, but I think the difference between manual selection and the penalisation using `select = TRUE` is vast. In the former you have used the training data to inform model selection - the standard errors and p-values know nothing of this selection and are thus biased. In the latter with `select = TRUE`, an additional penalty term in the smoothness selection is optimised over during model fitting. The p-values, whilst still approximate, are at least interpretable in that case. G
Best wishes Zoltan 2011.05.24. 5:21 keltezssel, ARISTIDES LOPEZ rta:
Hello all, Just a question, Im trying to fit my model throughout stepwise selection.At this point (with the valuable help of Gavin and Ben) my model are like this: model 1<-gam(Young (No. ind)~s(Lat, k=6)+s(Long, k=6)+s(Deep, k=6)+s(Area (km2),k=6)+as.factor (year),family=poisson,data=L. synagris) I have 4 species * 3 groups (young, adult and total) * 5 explanatory variables (Lat, Lon, Deep, Area, Year). So Im looking for a stepwise algorithm that help me to select the best model. I tried with step () in the stats package but R give me the following error message: "Error en glm.control(irls.reg = 0, epsilon = 1e-06, maxit = 100, trace = FALSE, : el argumento(s) no fue utilizado(s) (irls.reg = 0, mgcv.tol = 1e-07, mgcv.half = 15,..............." Any suggestion? Cheers Date: Wed, 18 May 2011 10:53:41 -0500 From: ARISTIDES LOPEZ<aristideslpz at gmail.com> To: r-sig-ecology at r-project.org Subject: [R-sig-eco] Error message in GAM Message-ID:<BANLkTikz-dQ=jV9YkfTGgEYO5uBWmcUsMw at mail.gmail.com> Content-Type: text/plain Dear members list, I'm trying to make a model for descrive the distribution of demersal fishes in the Colombian Caribbean Sea. I have a data set of n= 56, the model is like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem is that R give me the error message *"Model has more coefficients than data"*. Anybody knows how can avoid this? Faithfully. -- Aristides Lpez-Pea Date: Wed, 18 May 2011 17:48:04 +0100 From: Gavin Simpson<gavin.simpson at ucl.ac.uk> To: ARISTIDES LOPEZ<aristideslpz at gmail.com> Cc: r-sig-ecology at r-project.org Subject: Re: [R-sig-eco] Error message in GAM Message-ID:<1305737284.25148.15.camel at prometheus.geog.ucl.ac.uk> Content-Type: text/plain; charset="UTF-8" On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
Dear members list, I'm trying to make a model for descrive the distribution of demersal
fishes
in the Colombian Caribbean Sea. I have a data set of n= 56, the model is like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem is that R give me the error message *"Model has more coefficients than
data"*.
Anybody knows how can avoid this? Faithfully.
Each of your smooths will be using k = 10 degrees of freedom so that is 30 degrees of freedom already, which is a lot for a data set of 56 observations. Are all the data unique? i.e. you have 56 unique density values, 56 unique lats, 56 unique lons etc. If not, it might be the the unique information in the data is not sufficient to support the complexity of the smooths. My money would be on that you did something you haven't actually told us, and have more smooths in the model than you say and they are using more degrees of freedom than it appears to us. The easy way to try to solve the problem, will be to restrict the complexity of the individual smooths: response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6) for example. You could probably model these data as a Possion with an offset term for the km2 covered by each sample, rather than treating these as a density. HTH, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/> Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ------------------------------ Message: 9 Date: Wed, 18 May 2011 17:16:10 -0500 From: ARISTIDES LOPEZ<aristideslpz at gmail.com> To: r-sig-ecology at r-project.org, gavin.simpson at ucl.ac.uk Subject: Re: [R-sig-eco] Error message in GAM Message-ID:<BANLkTimUQhNjhdOX9LNNDdT60gSiWNX38w at mail.gmail.com> Content-Type: text/plain Dear Dr. Gavin, Thank you very much for your help. All my data are unique (because I have 56 different stations). As you suggest I restrict the complexity of the individual smooths: response ~ s(Lat, k = 9) + s(Long, k = 9) + s(deep, k = 9) Problem solved. Now I try to make other model: modelo2<-gam(Density~s(year, k=6)+s(Month, k=6)+s(rainfall, k=6), family=Gamma, data=at) The "new" problem is that R give me the next error *"Error en smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom"*. Anybody knows what mean this? Regards. 2011/5/18 Gavin Simpson<gavin.simpson at ucl.ac.uk>
On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
Dear members list, I'm trying to make a model for descrive the distribution of demersal
fishes
in the Colombian Caribbean Sea. I have a data set of n= 56, the model is like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem
is
that R give me the error message *"Model has more coefficients than
data"*.
Anybody knows how can avoid this? Faithfully.
Each of your smooths will be using k = 10 degrees of freedom so that is 30 degrees of freedom already, which is a lot for a data set of 56 observations. Are all the data unique? i.e. you have 56 unique density values, 56 unique lats, 56 unique lons etc. If not, it might be the the unique information in the data is not sufficient to support the complexity of the smooths. My money would be on that you did something you haven't actually told us, and have more smooths in the model than you say and they are using more degrees of freedom than it appears to us. The easy way to try to solve the problem, will be to restrict the complexity of the individual smooths: response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6) for example. You could probably model these data as a Possion with an offset term for the km2 covered by each sample, rather than treating these as a density. HTH, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/> Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
--
Aristides Lpez-Pea
[[alternative HTML version deleted]]
------------------------------
Message: 10
Date: Wed, 18 May 2011 18:28:20 -0400
From: Ben Bolker<bbolker at gmail.com>
To: r-sig-ecology at r-project.org
Subject: Re: [R-sig-eco] Error message in GAM
Message-ID:<4DD44804.1020705 at gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 05/18/2011 06:16 PM, ARISTIDES LOPEZ wrote:
Dear Dr. Gavin, Thank you very much for your help. All my data are unique (because I have
56
different stations). As you suggest I restrict the complexity of the individual smooths: response ~ s(Lat, k = 9) + s(Long, k = 9) + s(deep, k = 9) Problem solved. Now I try to make other model: modelo2<-gam(Density~s(year, k=6)+s(Month, k=6)+s(rainfall, k=6), family=Gamma, data=at) The "new" problem is that R give me the next error *"Error en smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom"*. Anybody knows what mean this? Regards.
It means you're pushing your data too hard: how about being
old-fashioned and fitting quadratic models [e.g. poly(Lat,2)] for each
of your predictor variables (this of course ignores interactions, which
you might ?? want to worry about in some cases -- but you probably
can't. In principle, gam() in the mgcv package (which is what I assume
you are using) tries to adjust the degree of complexity of your model
downward as appropriate, but it may be having a hard time doing so; can
you set k lower? For the models that do succeed, I would suspect that
the effective degrees of freedom fitted are much lower than the k values
you are specifying, so you could afford to reduce them (see ?choose.k )
Remember the rule of thumb that you should not be trying to fit more
than *at most* N/10 parameters, where N is your number of points -- so
quadratic models of 3 independent predictors (= 7 parameters, intercept
+ 2 for each predictor variable) would already be overfitting slightly.
cheers
Ben Bolker
2011/5/18 Gavin Simpson<gavin.simpson at ucl.ac.uk>
On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
Dear members list, I'm trying to make a model for descrive the distribution of demersal
fishes
in the Colombian Caribbean Sea. I have a data set of n= 56, the model is like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem
is
that R give me the error message *"Model has more coefficients than
data"*.
Anybody knows how can avoid this? Faithfully.
Each of your smooths will be using k = 10 degrees of freedom so that is 30 degrees of freedom already, which is a lot for a data set of 56 observations. Are all the data unique? i.e. you have 56 unique density values, 56 unique lats, 56 unique lons etc. If not, it might be the the unique information in the data is not sufficient to support the complexity of the smooths. My money would be on that you did something you haven't actually told us, and have more smooths in the model than you say and they are using more degrees of freedom than it appears to us. The easy way to try to solve the problem, will be to restrict the complexity of the individual smooths: response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6) for example. You could probably model these data as a Possion with an offset term for the km2 covered by each sample, rather than treating these as a density. HTH, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/> Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ -----END PGP SIGNATURE----- ------------------------------ Message: 11 Date: Thu, 19 May 2011 07:35:39 +0100 From: Gavin Simpson<gavin.simpson at ucl.ac.uk> To: ARISTIDES LOPEZ<aristideslpz at gmail.com> Cc: r-sig-ecology at r-project.org Subject: Re: [R-sig-eco] Error message in GAM Message-ID:<1305786939.2773.3.camel at chrysothemis.geog.ucl.ac.uk> Content-Type: text/plain; charset="UTF-8" On Wed, 2011-05-18 at 17:16 -0500, ARISTIDES LOPEZ wrote:
Dear Dr. Gavin, Thank you very much for your help. All my data are unique (because I have
56
different stations). As you suggest I restrict the complexity of the individual smooths: response ~ s(Lat, k = 9) + s(Long, k = 9) + s(deep, k = 9) Problem solved. Now I try to make other model: modelo2<-gam(Density~s(year, k=6)+s(Month, k=6)+s(rainfall, k=6), family=Gamma, data=at) The "new" problem is that R give me the next error *"Error en smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) : A term has fewer unique covariate combinations than specified maximum degrees of freedom"*.
It means exactly what it says. One of the terms in the model:
* s(year, k = 6)
* s(Month, k = 6)
* s(rainfall, k = 6)
has *fewer* then 6 unique values. Look at the outputs from
with(at, table(year))
with(at, table(Month))
with(at, table(rainfall))
to see which it(they) is(are).
G
Anybody knows what mean this? Regards. 2011/5/18 Gavin Simpson<gavin.simpson at ucl.ac.uk>
On Wed, 2011-05-18 at 10:53 -0500, ARISTIDES LOPEZ wrote:
Dear members list, I'm trying to make a model for descrive the distribution of demersal
fishes
in the Colombian Caribbean Sea. I have a data set of n= 56, the model
is
like this: Density (ind/km2) ~ s(Lat) + s(Long) + s(deep). The problem
is
that R give me the error message *"Model has more coefficients than
data"*.
Anybody knows how can avoid this? Faithfully.
Each of your smooths will be using k = 10 degrees of freedom so that is 30 degrees of freedom already, which is a lot for a data set of 56 observations. Are all the data unique? i.e. you have 56 unique density values, 56 unique lats, 56 unique lons etc. If not, it might be the the unique information in the data is not sufficient to support the complexity of the smooths. My money would be on that you did something you haven't actually told us, and have more smooths in the model than you say and they are using more degrees of freedom than it appears to us. The easy way to try to solve the problem, will be to restrict the complexity of the individual smooths: response ~ s(Lat, k = 6) + s(Long, k = 6) + s(deep, k = 6) for example. You could probably model these data as a Possion with an offset term for the km2 covered by each sample, rather than treating these as a density. HTH, G -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/> Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
-- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk<http://gavin.simpsonatnospamucl.ac.uk/> Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
[[alternative HTML version deleted]]
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%