Skip to content

p-level in packages mgcv and gam

11 messages · Denis Chabot, Yves Magliulo, Peter Dalgaard +2 more

#
Hi,

I am fairly new to GAM and started using package mgcv. I like the  
fact that optimal smoothing is automatically used (i.e. df are not  
determined a priori but calculated by the gam procedure).

But the mgcv manual warns that p-level for the smooth can be  
underestimated when df are estimated by the model. Most of the time  
my p-levels are so small that even doubling them would not result in  
a value close to the P=0.05 threshold, but I have one case with P=0.033.

I thought, probably naively, that running a second model with fixed  
df, using the value of df found in the first model. I could not  
achieve this with mgcv: its gam function does not seem to accept  
fractional values of df (in my case 8.377).

So I used the gam package and fixed df to 8.377. The P-value I  
obtained was slightly larger than with mgcv (0.03655 instead of  
0.03328), but it is still < 0.05.

Was this a correct way to get around the "underestimated P-level"?

Furthermore, although the gam.check function of the mgcv package  
suggests to me that the gaussian family (and identity link) are  
adequate for my data, I must say the instructions in R help for  
"family" and in Hastie, T. and Tibshirani, R. (1990) Generalized  
Additive Models are too technical for me. If someone knows a  
reference that explains how to choose model and link, i.e. what tests  
to run on your data before choosing, I would really appreciate it.

Thanks in advance,

Denis Chabot
#
On Mon, 26 Sep 2005, Denis Chabot wrote:
No, this won't work.  The problem is the usual one with model selection: 
the p-value is calculated as if the df had been fixed, when really it was 
estimated.

It is likely to be quite hard to get an honest p-value out of something 
that does adaptive smoothing.

 	-thomas
1 day later
#
I only got one reply to my message:
I do not understand this: it seems that a lot of people chose df=4  
for no particular reason, but p-levels are correct. If instead I  
choose df=8 because a previous model has estimated this to be an  
optimal df, P-levels are no good because df are estimated?

Furthermore, shouldn't packages gam and mgcv give similar results  
when the same data and df are used? I tried this:

library(gam)
data(kyphosis)
kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis)
kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis)
kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis)
anova.gam(kyp1)
anova.gam(kyp2)
anova.gam(kyp3)

detach(package:gam)
library(mgcv)
kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T),  family=binomial,  
data=kyphosis)
kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T),  family=binomial,  
data=kyphosis)
kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T),  family=binomial,  
data=kyphosis)
anova.gam(kyp4)
anova.gam(kyp5)
anova.gam(kyp6)


P levels for these models, by pair

kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad)
kyp2 vs kyp5: p= 0.445 and 0.03 (wow!)
kyp3 vs kyp6: p= 0.053 and 0.008 (wow again)

Also if you plot all these you find that the mgcv plots are smoother  
than the gam plots, even the same df are used all the time.

I am really confused now!

Denis

D??but du message r??exp??di?? :
#
Hi Yves,
Le 05-09-28 ?? 11:05, Yves Magliulo a ??crit :
Sorry, I did not see your message last week.
I think I follow you, but how do you "validate"? My fit goes very  
nicely in the middle of the data points and appears fine. In most  
cases p is way smaller than 0.001. I have one case that is bimodal in  
shape and more noisy, and p is only 0.03. How do I validate it, how  
do I check?
It is hard to evaluate the explanations based on the algorithm used  
to fit the data, but it seems to me that the answer, in terms of  
significance of the smooth, should be at least very similar.  
Otherwise, what do you do when an author cites one package? You  
wonder if the fit would have been significant using the other package?
Again, I can see that the p-values could differ a bit considering the  
differences between the 2 packages. But when the differences are huge  
and result in contradictory conclusions, I have a problem. Like you I  
hope more help is forthcoming.

Denis
#
On Wed, 28 Sep 2005, Denis Chabot wrote:

            
Yes. I know this sounds strange initially, but it really does make sense 
if you think about it carefully.

Suppose that Alice and Bob are kyphosis researchers, and that Alice always 
chooses 4df for smoothing Age.  We would all agree that her p-values are 
correct [in fact we wouldn't, but that is a separate issue]

Bob, on the other hand, chooses the amount of smoothing depending on the 
data. When a 4 df smooth fits best he ends up with the same model as Alice 
and the same p-value.  When some other df fits best he ends up with a 
different model and a *smaller* p-value than Alice.

In particular, this is still true under the null hypothesis that Age has 
no effect [If Alice and Bob are interested in p-values, the null 
hypothesis must be plausible.]

If Bob's p-values are always less than or equal to Alice's p-values under 
the null hypothesis, and Alice's p-values are less than 0.05 5% of the 
time, then Bob's p-values are less than 0.05 more than 5% of the time.


 	-thomas
Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
Thomas Lumley <tlumley at u.washington.edu> writes:
This doesn't actually follow, unless the p-value (directly or
indirectly) found its way into the definition of "best fit". It does
show the danger, though.
#
But what about another analogy, that of polynomials? You may not be  
sure what degree polynomial to use, and you have not decided before  
analysing your data. You fit different polynomials to your data,  
checking if added degrees increase r2 sufficiently by doing F-tests.

I thought it was the same thing with GAMs. You can fit a model with 4  
df, and in some cases it is of interest to see if this is a better  
fit than a linear fit. But why can't you also check if 7df is better  
than 4df? And if you used mgcv first and it tells you that 7df is  
better than 4df, why bother repeating the comparison 7df against 4df,  
why not just take the p-value for the model with 7df (fixed)?

Denis

Maybe one is in
Le 05-09-28 ?? 12:04, Peter Dalgaard a ??crit :
#
On Wed, 28 Sep 2005, Denis Chabot wrote:

            
Yes, you can. And this procedure gives you incorrect p-values.

  They may not be very incorrect -- it depends on how much model selection 
you do, and how strongly the feature you are selecting on is related to 
the one you are testing.

For example, using step() to choose a polynomial in x even when x is 
unrelated to y and z inflates the Type I error rate by giving a biased 
estimate of the residual mean squared error:

once<-function(){
   y<-rnorm(50);x<-runif(50);z<-rep(0:1,25)
   summary(step(lm(y~z),
         scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)),
         trace=0))$coef["z",4]
  }
[1] 0.072

which is significantly higher than you would expect for an honest level 
0.05 test.

 	-thomas
#
Yves Magliulo said the following on 2005-09-28 17:05:
This is not correct. The df are estimated in both cases (i.e. UBRE and 
GCV), but the scale parameter is fixed in the UBRE case. Hence, by 
default UBRE is used for family = binomial or poisson since the scale 
parameter is assumed to be 1. Similarly, GCV is the default for family = 
gaussian since we most often want the scale (usually denoted sigma^2) to 
be estimated.
The warning applies in both cases. The p-values are conditional on the 
smoothing parameters, and the uncertainty of the smooths is not taken 
into account when computing the p-values.
No...
Which version of mgcv do you use? The confusion probably stems from 
earlier versions of mgcv (< 1.3-5): the summary and anova methods used 
to have a column denoted Chi.sq even when the displayed statistic was 
computed as F. Recent versions of mgcv has

 > summary(b)

Family: gaussian
Link function: identity

Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)
(Intercept)   7.9150     0.1049   75.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Est.rank      F  p-value
s(x0) 5.173    9.000  3.785 0.000137 ***
s(x1) 2.357    9.000 34.631  < 2e-16 ***
s(x2) 8.517    9.000 84.694  < 2e-16 ***
s(x3) 1.000    1.000  0.444 0.505797
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.726   Deviance explained = 73.7%
GCV score =  4.611   Scale est. = 4.4029    n = 400


If we assume that the scale is known and fixed at 4.4029, we get

 > summary(b, dispersion = 4.4029)

Family: gaussian
Link function: identity

Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)

Parametric coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)   7.9150     0.1049   75.44   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
         edf Est.rank  Chi.sq p-value
s(x0) 5.173    9.000  34.067 8.7e-05 ***
s(x1) 2.357    9.000 311.679 < 2e-16 ***
s(x2) 8.517    9.000 762.255 < 2e-16 ***
s(x3) 1.000    1.000   0.444   0.505
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.726   Deviance explained = 73.7%
GCV score =  4.611   Scale est. = 4.4029    n = 400

Note that t/F changed into z/Chi.sq.
The underlying algorithms are very different.


HTH,
Henric