Hi, I am fairly new to GAM and started using package mgcv. I like the fact that optimal smoothing is automatically used (i.e. df are not determined a priori but calculated by the gam procedure). But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). So I used the gam package and fixed df to 8.377. The P-value I obtained was slightly larger than with mgcv (0.03655 instead of 0.03328), but it is still < 0.05. Was this a correct way to get around the "underestimated P-level"? Furthermore, although the gam.check function of the mgcv package suggests to me that the gaussian family (and identity link) are adequate for my data, I must say the instructions in R help for "family" and in Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models are too technical for me. If someone knows a reference that explains how to choose model and link, i.e. what tests to run on your data before choosing, I would really appreciate it. Thanks in advance, Denis Chabot
p-level in packages mgcv and gam
11 messages · Denis Chabot, Yves Magliulo, Peter Dalgaard +2 more
On Mon, 26 Sep 2005, Denis Chabot wrote:
But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377).
No, this won't work. The problem is the usual one with model selection: the p-value is calculated as if the df had been fixed, when really it was estimated. It is likely to be quite hard to get an honest p-value out of something that does adaptive smoothing. -thomas
1 day later
I only got one reply to my message:
No, this won't work. The problem is the usual one with model
selection: the p-value is calculated as if the df had been fixed,
when really it was estimated.
It is likely to be quite hard to get an honest p-value out of
something that does adaptive smoothing.
-thomas
I do not understand this: it seems that a lot of people chose df=4 for no particular reason, but p-levels are correct. If instead I choose df=8 because a previous model has estimated this to be an optimal df, P-levels are no good because df are estimated? Furthermore, shouldn't packages gam and mgcv give similar results when the same data and df are used? I tried this: library(gam) data(kyphosis) kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis) kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis) kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis) anova.gam(kyp1) anova.gam(kyp2) anova.gam(kyp3) detach(package:gam) library(mgcv) kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial, data=kyphosis) kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, data=kyphosis) kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial, data=kyphosis) anova.gam(kyp4) anova.gam(kyp5) anova.gam(kyp6) P levels for these models, by pair kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad) kyp2 vs kyp5: p= 0.445 and 0.03 (wow!) kyp3 vs kyp6: p= 0.053 and 0.008 (wow again) Also if you plot all these you find that the mgcv plots are smoother than the gam plots, even the same df are used all the time. I am really confused now! Denis D??but du message r??exp??di?? :
De : Denis Chabot <chabotd at globetrotter.net> Date : 26 septembre 2005 12:25:04 HAE ?? : r-help at stat.math.ethz.ch Objet : p-level in packages mgcv and gam Hi, I am fairly new to GAM and started using package mgcv. I like the fact that optimal smoothing is automatically used (i.e. df are not determined a priori but calculated by the gam procedure). But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). So I used the gam package and fixed df to 8.377. The P-value I obtained was slightly larger than with mgcv (0.03655 instead of 0.03328), but it is still < 0.05. Was this a correct way to get around the "underestimated P-level"? Furthermore, although the gam.check function of the mgcv package suggests to me that the gaussian family (and identity link) are adequate for my data, I must say the instructions in R help for "family" and in Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models are too technical for me. If someone knows a reference that explains how to choose model and link, i.e. what tests to run on your data before choosing, I would really appreciate it. Thanks in advance, Denis Chabot
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20050928/a94bf6c1/attachment.pl
Hi Yves, Le 05-09-28 ?? 11:05, Yves Magliulo a ??crit :
hi, i'll try to help you, i send a mail about this subject last week... and i did not have any response...
Sorry, I did not see your message last week.
I'm using gam from package mgcv. 1) How to interpret the significance of smooth terms is hard for me to understand perfectly : using UBRE, you fix df. p-value are estimated by chi-sq distribution using GCV, the best df are estimated by GAM. (that's what i want) and p-values are estimated by an F distribution But in that case they said "use at your own risk" in ?summary.gam so you can also look at the chi.sq : but i don't know how to choose a criterion like for p-values... for me, chi.sq show the best predictor in a model, but it's hard to reject one with it. so as far as i m concerned, i use GCV methods, and fix a 5% on the null hypothesis (pvalue) to select significant predictor. after, i look at my smooth, and if the parametrization look fine to me, i validate. generaly, for p-values smaller than 0.001, you can be confident. over 0.001, you have to check.
I think I follow you, but how do you "validate"? My fit goes very nicely in the middle of the data points and appears fine. In most cases p is way smaller than 0.001. I have one case that is bimodal in shape and more noisy, and p is only 0.03. How do I validate it, how do I check?
2) for difference between package gam and mgcv, i sent a mail about this one year ago, here's the response : " - package gam is based very closely on the GAM approach presented in Hastie and Tibshirani's "Generalized Additive Models" book. Estimation is by back-fitting and model selection is based on step-wise regression methods based on approximate distributional results. A particular strength of this approach is that local regression smoothers (`lo()' terms) can be included in GAM models. - gam in package mgcv represents GAMs using penalized regression splines. Estimation is by direct penalized likelihood maximization with integrated smoothness estimation via GCV or related criteria (there is also an alternative `gamm' function based on a mixed model approach). Strengths of the this approach are that s() terms can be functions of more than one variable and that tensor product smooths are available via te() terms - these are useful when different degrees of smoothness are appropriate relative to different arguments of a smooth. (...) Basically, if you want integrated smoothness selection, an underlying parametric representation, or want smooth interactions in your models then mgcv is probably worth a try (but I would say that). If you want to use local regression smoothers and/or prefer the stepwise selection approach then package gam is for you. "
It is hard to evaluate the explanations based on the algorithm used to fit the data, but it seems to me that the answer, in terms of significance of the smooth, should be at least very similar. Otherwise, what do you do when an author cites one package? You wonder if the fit would have been significant using the other package?
i think the difference of p-values between :gam and :mgcv, is because you don't have same number of step iteration. mgcv : gam choose the number of step and with gam : gam you have to choose it.. hope it helps and someone gives us more details... Yves
Again, I can see that the p-values could differ a bit considering the differences between the 2 packages. But when the differences are huge and result in contradictory conclusions, I have a problem. Like you I hope more help is forthcoming. Denis
On Wed, 28 Sep 2005, Denis Chabot wrote:
I only got one reply to my message:
No, this won't work. The problem is the usual one with model
selection: the p-value is calculated as if the df had been fixed,
when really it was estimated.
It is likely to be quite hard to get an honest p-value out of
something that does adaptive smoothing.
-thomas
I do not understand this: it seems that a lot of people chose df=4 for no particular reason, but p-levels are correct. If instead I choose df=8 because a previous model has estimated this to be an optimal df, P-levels are no good because df are estimated?
Yes. I know this sounds strange initially, but it really does make sense if you think about it carefully. Suppose that Alice and Bob are kyphosis researchers, and that Alice always chooses 4df for smoothing Age. We would all agree that her p-values are correct [in fact we wouldn't, but that is a separate issue] Bob, on the other hand, chooses the amount of smoothing depending on the data. When a 4 df smooth fits best he ends up with the same model as Alice and the same p-value. When some other df fits best he ends up with a different model and a *smaller* p-value than Alice. In particular, this is still true under the null hypothesis that Age has no effect [If Alice and Bob are interested in p-values, the null hypothesis must be plausible.] If Bob's p-values are always less than or equal to Alice's p-values under the null hypothesis, and Alice's p-values are less than 0.05 5% of the time, then Bob's p-values are less than 0.05 more than 5% of the time. -thomas
Furthermore, shouldn't packages gam and mgcv give similar results when the same data and df are used? I tried this: library(gam) data(kyphosis) kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis) kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis) kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis) anova.gam(kyp1) anova.gam(kyp2) anova.gam(kyp3) detach(package:gam) library(mgcv) kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial, data=kyphosis) kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, data=kyphosis) kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial, data=kyphosis) anova.gam(kyp4) anova.gam(kyp5) anova.gam(kyp6) P levels for these models, by pair kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad) kyp2 vs kyp5: p= 0.445 and 0.03 (wow!) kyp3 vs kyp6: p= 0.053 and 0.008 (wow again) Also if you plot all these you find that the mgcv plots are smoother than the gam plots, even the same df are used all the time. I am really confused now! Denis D?but du message r?exp?di? :
De : Denis Chabot <chabotd at globetrotter.net> Date : 26 septembre 2005 12:25:04 HAE ? : r-help at stat.math.ethz.ch Objet : p-level in packages mgcv and gam Hi, I am fairly new to GAM and started using package mgcv. I like the fact that optimal smoothing is automatically used (i.e. df are not determined a priori but calculated by the gam procedure). But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). So I used the gam package and fixed df to 8.377. The P-value I obtained was slightly larger than with mgcv (0.03655 instead of 0.03328), but it is still < 0.05. Was this a correct way to get around the "underestimated P-level"? Furthermore, although the gam.check function of the mgcv package suggests to me that the gaussian family (and identity link) are adequate for my data, I must say the instructions in R help for "family" and in Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models are too technical for me. If someone knows a reference that explains how to choose model and link, i.e. what tests to run on your data before choosing, I would really appreciate it. Thanks in advance, Denis Chabot
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Thomas Lumley Assoc. Professor, Biostatistics tlumley at u.washington.edu University of Washington, Seattle
Thomas Lumley <tlumley at u.washington.edu> writes:
Bob, on the other hand, chooses the amount of smoothing depending on the data. When a 4 df smooth fits best he ends up with the same model as Alice and the same p-value. When some other df fits best he ends up with a different model and a *smaller* p-value than Alice.
This doesn't actually follow, unless the p-value (directly or indirectly) found its way into the definition of "best fit". It does show the danger, though.
O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
But what about another analogy, that of polynomials? You may not be sure what degree polynomial to use, and you have not decided before analysing your data. You fit different polynomials to your data, checking if added degrees increase r2 sufficiently by doing F-tests. I thought it was the same thing with GAMs. You can fit a model with 4 df, and in some cases it is of interest to see if this is a better fit than a linear fit. But why can't you also check if 7df is better than 4df? And if you used mgcv first and it tells you that 7df is better than 4df, why bother repeating the comparison 7df against 4df, why not just take the p-value for the model with 7df (fixed)? Denis Maybe one is in Le 05-09-28 ?? 12:04, Peter Dalgaard a ??crit :
Thomas Lumley <tlumley at u.washington.edu> writes:
Bob, on the other hand, chooses the amount of smoothing depending on the data. When a 4 df smooth fits best he ends up with the same model as Alice and the same p-value. When some other df fits best he ends up with a different model and a *smaller* p-value than Alice.
This doesn't actually follow, unless the p-value (directly or indirectly) found its way into the definition of "best fit". It does show the danger, though. -- O__ ---- Peter Dalgaard ??ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Wed, 28 Sep 2005, Denis Chabot wrote:
But what about another analogy, that of polynomials? You may not be sure what degree polynomial to use, and you have not decided before analysing your data. You fit different polynomials to your data, checking if added degrees increase r2 sufficiently by doing F-tests.
Yes, you can. And this procedure gives you incorrect p-values.
They may not be very incorrect -- it depends on how much model selection
you do, and how strongly the feature you are selecting on is related to
the one you are testing.
For example, using step() to choose a polynomial in x even when x is
unrelated to y and z inflates the Type I error rate by giving a biased
estimate of the residual mean squared error:
once<-function(){
y<-rnorm(50);x<-runif(50);z<-rep(0:1,25)
summary(step(lm(y~z),
scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)),
trace=0))$coef["z",4]
}
p<-replicate(1000,once()) mean(p<0.05)
[1] 0.072 which is significantly higher than you would expect for an honest level 0.05 test. -thomas
Yves Magliulo said the following on 2005-09-28 17:05:
hi, i'll try to help you, i send a mail about this subject last week... and i did not have any response... I'm using gam from package mgcv. 1) How to interpret the significance of smooth terms is hard for me to understand perfectly : using UBRE, you fix df. p-value are estimated by chi-sq distribution using GCV, the best df are estimated by GAM. (that's what i want) and p-values
This is not correct. The df are estimated in both cases (i.e. UBRE and GCV), but the scale parameter is fixed in the UBRE case. Hence, by default UBRE is used for family = binomial or poisson since the scale parameter is assumed to be 1. Similarly, GCV is the default for family = gaussian since we most often want the scale (usually denoted sigma^2) to be estimated.
are estimated by an F distribution But in that case they said "use at your own risk" in ?summary.gam
The warning applies in both cases. The p-values are conditional on the smoothing parameters, and the uncertainty of the smooths is not taken into account when computing the p-values.
so you can also look at the chi.sq : but i don't know how to choose a
No...
criterion like for p-values... for me, chi.sq show the best predictor in a model, but it's hard to reject one with it.
Which version of mgcv do you use? The confusion probably stems from
earlier versions of mgcv (< 1.3-5): the summary and anova methods used
to have a column denoted Chi.sq even when the displayed statistic was
computed as F. Recent versions of mgcv has
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.9150 0.1049 75.44 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Est.rank F p-value
s(x0) 5.173 9.000 3.785 0.000137 ***
s(x1) 2.357 9.000 34.631 < 2e-16 ***
s(x2) 8.517 9.000 84.694 < 2e-16 ***
s(x3) 1.000 1.000 0.444 0.505797
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.726 Deviance explained = 73.7%
GCV score = 4.611 Scale est. = 4.4029 n = 400
If we assume that the scale is known and fixed at 4.4029, we get
> summary(b, dispersion = 4.4029)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)
Parametric coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 7.9150 0.1049 75.44 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Est.rank Chi.sq p-value
s(x0) 5.173 9.000 34.067 8.7e-05 ***
s(x1) 2.357 9.000 311.679 < 2e-16 ***
s(x2) 8.517 9.000 762.255 < 2e-16 ***
s(x3) 1.000 1.000 0.444 0.505
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.726 Deviance explained = 73.7%
GCV score = 4.611 Scale est. = 4.4029 n = 400
Note that t/F changed into z/Chi.sq.
so as far as i m concerned, i use GCV methods, and fix a 5% on the null hypothesis (pvalue) to select significant predictor. after, i look at my smooth, and if the parametrization look fine to me, i validate. generaly, for p-values smaller than 0.001, you can be confident. over 0.001, you have to check. 2) for difference between package gam and mgcv, i sent a mail about this
The underlying algorithms are very different. HTH, Henric
one year ago, here's the response : " - package gam is based very closely on the GAM approach presented in Hastie and Tibshirani's "Generalized Additive Models" book. Estimation is by back-fitting and model selection is based on step-wise regression methods based on approximate distributional results. A particular strength of this approach is that local regression smoothers (`lo()' terms) can be included in GAM models. - gam in package mgcv represents GAMs using penalized regression splines. Estimation is by direct penalized likelihood maximization with integrated smoothness estimation via GCV or related criteria (there is also an alternative `gamm' function based on a mixed model approach). Strengths of the this approach are that s() terms can be functions of more than one variable and that tensor product smooths are available via te() terms - these are useful when different degrees of smoothness are appropriate relative to different arguments of a smooth. (...) Basically, if you want integrated smoothness selection, an underlying parametric representation, or want smooth interactions in your models then mgcv is probably worth a try (but I would say that). If you want to use local regression smoothers and/or prefer the stepwise selection approach then package gam is for you. " i think the difference of p-values between :gam and :mgcv, is because you don't have same number of step iteration. mgcv : gam choose the number of step and with gam : gam you have to choose it.. hope it helps and someone gives us more details... Yves Le mer 28/09/2005 ?? 15:30, Denis Chabot a ??crit :
I only got one reply to my message:
No, this won't work. The problem is the usual one with model selection: the p-value is calculated as if the df had been fixed, when really it was estimated. It is likely to be quite hard to get an honest p-value out of something that does adaptive smoothing. -thomas
I do not understand this: it seems that a lot of people chose df=4 for no particular reason, but p-levels are correct. If instead I choose df=8 because a previous model has estimated this to be an optimal df, P-levels are no good because df are estimated? Furthermore, shouldn't packages gam and mgcv give similar results when the same data and df are used? I tried this: library(gam) data(kyphosis) kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis) kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis) kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis) anova.gam(kyp1) anova.gam(kyp2) anova.gam(kyp3) detach(package:gam) library(mgcv) kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial, data=kyphosis) kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, data=kyphosis) kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial, data=kyphosis) anova.gam(kyp4) anova.gam(kyp5) anova.gam(kyp6) P levels for these models, by pair kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad) kyp2 vs kyp5: p= 0.445 and 0.03 (wow!) kyp3 vs kyp6: p= 0.053 and 0.008 (wow again) Also if you plot all these you find that the mgcv plots are smoother than the gam plots, even the same df are used all the time. I am really confused now! Denis D??but du message r??exp??di?? :
De : Denis Chabot <chabotd at globetrotter.net> Date : 26 septembre 2005 12:25:04 HAE ?? : r-help at stat.math.ethz.ch Objet : p-level in packages mgcv and gam Hi, I am fairly new to GAM and started using package mgcv. I like the fact that optimal smoothing is automatically used (i.e. df are not determined a priori but calculated by the gam procedure). But the mgcv manual warns that p-level for the smooth can be underestimated when df are estimated by the model. Most of the time my p-levels are so small that even doubling them would not result in a value close to the P=0.05 threshold, but I have one case with P=0.033. I thought, probably naively, that running a second model with fixed df, using the value of df found in the first model. I could not achieve this with mgcv: its gam function does not seem to accept fractional values of df (in my case 8.377). So I used the gam package and fixed df to 8.377. The P-value I obtained was slightly larger than with mgcv (0.03655 instead of 0.03328), but it is still < 0.05. Was this a correct way to get around the "underestimated P-level"? Furthermore, although the gam.check function of the mgcv package suggests to me that the gaussian family (and identity link) are adequate for my data, I must say the instructions in R help for "family" and in Hastie, T. and Tibshirani, R. (1990) Generalized Additive Models are too technical for me. If someone knows a reference that explains how to choose model and link, i.e. what tests to run on your data before choosing, I would really appreciate it. Thanks in advance, Denis Chabot
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[[alternative HTML version deleted]] ------------------------------------------------------------------------
______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
An embedded and charset-unspecified text was scrubbed... Name: not available Url: https://stat.ethz.ch/pipermail/r-help/attachments/20050929/544e34c8/attachment.pl