Hi all! I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth):
x_sq <- x * x
x_cb <- x * x * x
x_qt <- x * x * x * x
model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)
This works great, and I get a nice fit. My question regards the confidence intervals on that fit. I am calling:
cis <- confint.default(model, level=0.95)
to get the confidence intervals for each coefficient. This confint.default() call gives me a table like this (where "dispersal" is the real name of what I'm calling "x" above):
2.5 % 97.5 %
(Intercept) 8.7214297 8.9842533
dispersal -37.5621412 -35.6629981
dispersal_sq 66.8077669 74.4490123
dispersal_cb -74.5963861 -62.8347057
dispersal_qt 22.6590159 28.6506186
A side note: calling confint() instead of confint.default() is much, much slower -- my model has many other terms I'm not discussing here, and is fit to 300,000 data points -- and a check indicates that the intervals returned for my model by confint() are not virtually identical, so this is not the source of my problem:
2.5 % 97.5 %
(Intercept) 8.7216693 8.9844958
dispersal -37.5643922 -35.6652276
dispersal_sq 66.8121519 74.4534753
dispersal_cb -74.5995820 -62.8377766
dispersal_qt 22.6592724 28.6509494
These tables show the problem: the 95% confidence limits for the quartic term are every bit as wide as the limits for the other terms. Since the quartic term coefficient gets multiplied by the fourth power of x, this means that the width of the confidence band starts out nice and narrow (when x is close to zero, where the width of the confidence band is pretty much just determined by the confidence limits on the intercept) but balloons out to be tremendously wide for larger values of x. The width of it looks quite unreasonable to me, given that my fit is constrained by 300,000 data points.
Intuitively, I would have expected the confidence limits for the quartic term to be much narrower than for, say, the linear term, because the effect of variation in the quartic term is so much larger. But the way confidence intervals are calculated in R, at least, does not seem to follow my instincts here. In other words, predicted values using the 2.5% value of the linear coefficient and the 97.5% value of the linear coefficient are really not very different, but predicted values using the 2.5% value of the quadratic coefficient and the 97.5% value of the quadratic coefficient are enormously, wildly divergent -- far beyond the range of variability in the original data, in fact. I feel quite confident, in a seat-of-the-pants sense, that the actual 95% limits of that quartic coefficient are much, much narrower than what R is giving me.
Looking at it another way: if I just exclude the quartic term from the glm() fit, I get a dramatically narrower confidence band, even though the fit to the data is not nearly as good (I'm keeping the fourth power for good reasons). This again seems to me to indicate that the confidence band with the quartic term included is artificially, and incorrectly, wide. If a fit with reasonably narrow confidence limits can be produced for the model with terms only up to cubic (or, taking this reductio even further, for a quadratic or a linear model, also), why does adding the quadratic term, which improves the quality of the fit to the data, cause the confidence limits to nevertheless balloon outwards? That seems fundamentally illogical to me, but maybe my instincts are bad.
So what's going on here? Is this just a limitation of the way confint() computes confidence intervals? Is there a better function to use, that is compatible with binomial glm() fits? Or am I misinterpreting the meaning of the confidence limits in some way? Or what?
Thanks for any advice. I've had trouble finding examples of polynomial fits like this; people sometimes fit the square of the explanatory variable, of course, but going up to fourth powers seems to be quite uncommon, so I've had trouble finding out how others have dealt with these sorts of issues that crop up only with higher powers.
Ben Haller
McGill University
http://biology.mcgill.ca/grad/ben/
Confidence intervals and polynomial fits
12 messages · Ben Haller, Joshua Wiley, David Winsemius +4 more
Hi Ben,
From what you have written, I am not exactly sure what your
seat-of-the-pant sense is coming from. My pantseat typically does not tell me much; however, quartic trends tend to less stable than linear, so I am not terribly surprised. As two side notes: x_qt <- x^4 # shorter code-wise and you can certain just enter a quartic term if the data is just quartic and you are not really itnerested in the lower trends. Cheers, Josh
On Fri, May 6, 2011 at 8:35 AM, Ben Haller <rhelp at sticksoftware.com> wrote:
?Hi all! ?I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. ?So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth): x_sq <- x * x x_cb <- x * x * x x_qt <- x * x * x * x model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial) ?This works great, and I get a nice fit. ?My question regards the confidence intervals on that fit. ?I am calling: cis <- confint.default(model, level=0.95) to get the confidence intervals for each coefficient. ?This confint.default() call gives me a table like this (where "dispersal" is the real name of what I'm calling "x" above): ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2.5 % ? ? ?97.5 % (Intercept) ? ? ? ? ? ? ?8.7214297 ? 8.9842533 dispersal ? ? ? ? ? ? ?-37.5621412 -35.6629981 dispersal_sq ? ? ? ? ? ?66.8077669 ?74.4490123 dispersal_cb ? ? ? ? ? -74.5963861 -62.8347057 dispersal_qt ? ? ? ? ? ?22.6590159 ?28.6506186 ?A side note: calling confint() instead of confint.default() is much, much slower -- my model has many other terms I'm not discussing here, and is fit to 300,000 data points -- and a check indicates that the intervals returned for my model by confint() are not virtually identical, so this is not the source of my problem: ? ? ? ? ? ? ? ? ? ? ? ? ? ? 2.5 % ? ? ?97.5 % (Intercept) ? ? ? ? ? ? ?8.7216693 ? 8.9844958 dispersal ? ? ? ? ? ? ?-37.5643922 -35.6652276 dispersal_sq ? ? ? ? ? ?66.8121519 ?74.4534753 dispersal_cb ? ? ? ? ? -74.5995820 -62.8377766 dispersal_qt ? ? ? ? ? ?22.6592724 ?28.6509494 ?These tables show the problem: the 95% confidence limits for the quartic term are every bit as wide as the limits for the other terms. ?Since the quartic term coefficient gets multiplied by the fourth power of x, this means that the width of the confidence band starts out nice and narrow (when x is close to zero, where the width of the confidence band is pretty much just determined by the confidence limits on the intercept) but balloons out to be tremendously wide for larger values of x. ?The width of it looks quite unreasonable to me, given that my fit is constrained by 300,000 data points. ?Intuitively, I would have expected the confidence limits for the quartic term to be much narrower than for, say, the linear term, because the effect of variation in the quartic term is so much larger. ?But the way confidence intervals are calculated in R, at least, does not seem to follow my instincts here. ?In other words, predicted values using the 2.5% value of the linear coefficient and the 97.5% value of the linear coefficient are really not very different, but predicted values using the 2.5% value of the quadratic coefficient and the 97.5% value of the quadratic coefficient are enormously, wildly divergent -- far beyond the range of variability in the original data, in fact. ?I feel quite confident, in a seat-of-the-pants sense, that the actual 95% limits of that quartic coefficient are much, much narrower than what R is giving me. ?Looking at it another way: if I just exclude the quartic term from the glm() fit, I get a dramatically narrower confidence band, even though the fit to the data is not nearly as good (I'm keeping the fourth power for good reasons). ?This again seems to me to indicate that the confidence band with the quartic term included is artificially, and incorrectly, wide. ?If a fit with reasonably narrow confidence limits can be produced for the model with terms only up to cubic (or, taking this reductio even further, for a quadratic or a linear model, also), why does adding the quadratic term, which improves the quality of the fit to the data, cause the confidence limits to nevertheless balloon outwards? ?That seems fundamentally illogical to me, but maybe my instincts are bad. ?So what's going on here? ?Is this just a limitation of the way confint() computes confidence intervals? ?Is there a better function to use, that is compatible with binomial glm() fits? ?Or am I misinterpreting the meaning of the confidence limits in some way? ?Or what? ?Thanks for any advice. ?I've had trouble finding examples of polynomial fits like this; people sometimes fit the square of the explanatory variable, of course, but going up to fourth powers seems to be quite uncommon, so I've had trouble finding out how others have dealt with these sorts of issues that crop up only with higher powers. Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles http://www.joshuawiley.com/
On May 6, 2011, at 11:35 AM, Ben Haller wrote:
Hi all! I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth): x_sq <- x * x x_cb <- x * x * x x_qt <- x * x * x * x model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)
It might have been easier with:
model <- glm(outcome ~ poly(x, 4) , binomial)
Since the authors of confint might have been expecting a poly()
formulation the results might be more reliable. I'm just using lm()
but I think the conclusions are more general:
> set.seed(123)
> x <-seq(0, 4, by=0.1)
> y <-seq(0, 4, by=0.1)^2 +rnorm(41)
> x2 <- x^2
> x3 <- x^3
> summary(lm(y~x+x2+x3 ) )
Call:
lm(formula = y ~ x + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-1.8528 -0.6146 -0.1164 0.6211 1.8923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.45737 0.52499 0.871 0.3893
x -0.75989 1.15080 -0.660 0.5131
x2 1.30987 0.67330 1.945 0.0594 .
x3 -0.03559 0.11058 -0.322 0.7494
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 0.9184 on 37 degrees of freedom
Multiple R-squared: 0.9691, Adjusted R-squared: 0.9666
F-statistic: 386.8 on 3 and 37 DF, p-value: < 2.2e-16
I wouldn't have been able to understand why a highly significant
relationship overall could not be ascribed to any of the terms. See
what happens with poly(x,3)
> summary( lm(y~poly(x,3) ) )
Call:
lm(formula = y ~ poly(x, 3))
Residuals:
Min 1Q Median 3Q Max
-1.8528 -0.6146 -0.1164 0.6211 1.8923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4271 0.1434 37.839 < 2e-16 ***
poly(x, 3)1 30.0235 0.9184 32.692 < 2e-16 ***
poly(x, 3)2 8.7823 0.9184 9.563 1.53e-11 ***
poly(x, 3)3 -0.2956 0.9184 -0.322 0.749
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 0.9184 on 37 degrees of freedom
Multiple R-squared: 0.9691, Adjusted R-squared: 0.9666
F-statistic: 386.8 on 3 and 37 DF, p-value: < 2.2e-16
> confint( lm(y~poly(x,3) ) )
2.5 % 97.5 %
(Intercept) 5.136526 5.717749
poly(x, 3)1 28.162706 31.884351
poly(x, 3)2 6.921505 10.643149
poly(x, 3)3 -2.156431 1.565213
> confint(lm(y~x+x2+x3 ) )
2.5 % 97.5 %
(Intercept) -0.60636547 1.5211072
x -3.09164639 1.5718576
x2 -0.05437802 2.6741120
x3 -0.25964705 0.1884609
Neither version exactly captures the simulation input, which would
have shown only an effect of the quadratic, but I didn't do any
centering. At least the poly version shows the lower order terms up to
quadratic as "significant", whereas it's difficult to extract much
meaningful out of the separately calculated term version.
David. > > This works great, and I get a nice fit. My question regards the > confidence intervals on that fit. I am calling: > > cis <- confint.default(model, level=0.95) > > to get the confidence intervals for each coefficient. This > confint.default() call gives me a table like this (where "dispersal" > is the real name of what I'm calling "x" above): > > 2.5 % 97.5 % > (Intercept) 8.7214297 8.9842533 > dispersal -37.5621412 -35.6629981 > dispersal_sq 66.8077669 74.4490123 > dispersal_cb -74.5963861 -62.8347057 > dispersal_qt 22.6590159 28.6506186 > > A side note: calling confint() instead of confint.default() is > much, much slower -- my model has many other terms I'm not > discussing here, and is fit to 300,000 data points -- and a check > indicates that the intervals returned for my model by confint() are > not virtually identical, so this is not the source of my problem: > > 2.5 % 97.5 % > (Intercept) 8.7216693 8.9844958 > dispersal -37.5643922 -35.6652276 > dispersal_sq 66.8121519 74.4534753 > dispersal_cb -74.5995820 -62.8377766 > dispersal_qt 22.6592724 28.6509494 > > These tables show the problem: the 95% confidence limits for the > quartic term are every bit as wide as the limits for the other > terms. Since the quartic term coefficient gets multiplied by the > fourth power of x, this means that the width of the confidence band > starts out nice and narrow (when x is close to zero, where the width > of the confidence band is pretty much just determined by the > confidence limits on the intercept) but balloons out to be > tremendously wide for larger values of x. The width of it looks > quite unreasonable to me, given that my fit is constrained by > 300,000 data points. > > Intuitively, I would have expected the confidence limits for the > quartic term to be much narrower than for, say, the linear term, > because the effect of variation in the quartic term is so much > larger. But the way confidence intervals are calculated in R, at > least, does not seem to follow my instincts here. In other words, > predicted values using the 2.5% value of the linear coefficient and > the 97.5% value of the linear coefficient are really not very > different, but predicted values using the 2.5% value of the > quadratic coefficient and the 97.5% value of the quadratic > coefficient are enormously, wildly divergent -- far beyond the range > of variability in the original data, in fact. I feel quite > confident, in a seat-of-the-pants sense, that the actual 95% limits > of that quartic coefficient are much, much narrower than what R is > giving me. > > Looking at it another way: if I just exclude the quartic term from > the glm() fit, I get a dramatically narrower confidence band, even > though the fit to the data is not nearly as good (I'm keeping the > fourth power for good reasons). This again seems to me to indicate > that the confidence band with the quartic term included is > artificially, and incorrectly, wide. If a fit with reasonably > narrow confidence limits can be produced for the model with terms > only up to cubic (or, taking this reductio even further, for a > quadratic or a linear model, also), why does adding the quadratic > term, which improves the quality of the fit to the data, cause the > confidence limits to nevertheless balloon outwards? That seems > fundamentally illogical to me, but maybe my instincts are bad. > > So what's going on here? Is this just a limitation of the way > confint() computes confidence intervals? Is there a better function > to use, that is compatible with binomial glm() fits? Or am I > misinterpreting the meaning of the confidence limits in some way? > Or what? > > Thanks for any advice. I've had trouble finding examples of > polynomial fits like this; people sometimes fit the square of the > explanatory variable, of course, but going up to fourth powers seems > to be quite uncommon, so I've had trouble finding out how others > have dealt with these sorts of issues that crop up only with higher > powers. > > Ben Haller > McGill University > > http://biology.mcgill.ca/grad/ben/ > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD West Hartford, CT
From what you have written, I am not exactly sure what your seat-of-the-pant sense is coming from. My pantseat typically does not tell me much; however, quartic trends tend to less stable than linear, so I am not terribly surprised.
My pantseat is not normally very informative either, but if you saw the width of the confidence limits I'm getting for the quartic coefficient, I think your pantseat would agree with mine. :-> The confidence band is staggeringly wide, many times the variation in the data itself; and with 300,000 data points to fit to, that just should not be the case. With that many data points, it seems quite obvious that one can be "95% confident" that the true data lies within a band somewhere reasonably close to the band that the sample data actually fall into, not a band many times wider.
As two side notes: x_qt <- x^4 # shorter code-wise and you can certain just enter a quartic term if the data is just quartic and you are not really itnerested in the lower trends.
Yep, for sure. Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
On May 6, 2011, at 12:31 PM, David Winsemius wrote:
On May 6, 2011, at 11:35 AM, Ben Haller wrote:
Hi all! I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth): x_sq <- x * x x_cb <- x * x * x x_qt <- x * x * x * x model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)
It might have been easier with: model <- glm(outcome ~ poly(x, 4) , binomial)
Interesting. The actual model I'm fitting has lots more terms, and needs to be able to be stepped down; sometimes some of the terms of the polynomials will get dropped while others get retained, for example. But more to the point, poly() seems to be doing something tricky involving "orthogonal polynomials" that I don't understand. I don't think I want whatever that is; I want my original variable x, plus its powers. For example, if I do: x < runif(10) poly(x, 3) the columns I get are not x, x^2, x^3; they are something else. So the poly() fit is not equivalent to my fit.
Since the authors of confint might have been expecting a poly() formulation the results might be more reliable. I'm just using lm() but I think the conclusions are more general:
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.45737 0.52499 0.871 0.3893
x -0.75989 1.15080 -0.660 0.5131
x2 1.30987 0.67330 1.945 0.0594 .
x3 -0.03559 0.11058 -0.322 0.7494
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4271 0.1434 37.839 < 2e-16 ***
poly(x, 3)1 30.0235 0.9184 32.692 < 2e-16 ***
poly(x, 3)2 8.7823 0.9184 9.563 1.53e-11 ***
poly(x, 3)3 -0.2956 0.9184 -0.322 0.749
Here, in your illustration, is underscored what I mean. Whatever these orthogonal polynomial terms are that you're using, they are clearly not the original x, x^2 and x^3, and they're giving you a different fit than those do. I probably ought to learn about this technique, since it looks interesting; but for my purposes I need the fit to actually be in terms of x, since x is my explanatory variable. And the fit I'm getting is highly significant (all terms < 2e-16), so the lack of fit problem you're illustrating does not seem to apply to my case. Or maybe I'm totally misunderstanding your point...? :-> Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
FWIW: Fitting higher order polynomials (say > 2) is almost always a bad idea. See e.g. the Hastie, Tibshirani, et. al book on "Statistical Learning" for a detailed explanation why. The Wikipedia entry on "smoothing splines" also contains a brief explanation, I believe. Your ~0 P values for the coefficients also suggest problems/confusion (!) -- perhaps you need to consider something along the lines of "functional data analysis" for your analysis. Having no knowledge of your issues, these remarks are entirely speculative and may well be wrong. So feel free to dismiss. Nevertheless, you may find it useful to consult your local statistician for help. Cheers, Bert P.S. The raw = TRUE option of poly will fit raw, not orthogonal, polynomials. The fitted projections will be the same up to numerical error whichever basis is chosen, of course.
On Fri, May 6, 2011 at 10:08 AM, Ben Haller <rhelp at sticksoftware.com> wrote:
On May 6, 2011, at 12:31 PM, David Winsemius wrote:
On May 6, 2011, at 11:35 AM, Ben Haller wrote:
Hi all! ?I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. ?So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth): x_sq <- x * x x_cb <- x * x * x x_qt <- x * x * x * x model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)
It might have been easier with: model <- glm(outcome ~ poly(x, 4) , binomial)
?Interesting. ?The actual model I'm fitting has lots more terms, and needs to be able to be stepped down; sometimes some of the terms of the polynomials will get dropped while others get retained, for example. ?But more to the point, poly() seems to be doing something tricky involving "orthogonal polynomials" that I don't understand. ?I don't think I want whatever that is; I want my original variable x, plus its powers. ?For example, if I do: x < runif(10) poly(x, 3) the columns I get are not x, x^2, x^3; they are something else. ?So the poly() fit is not equivalent to my fit.
Since the authors of confint might have been expecting a poly() formulation the results might be more reliable. I'm just using lm() but I think the conclusions are more general: ... Coefficients: ? ? ? ? ? ?Estimate Std. Error t value Pr(>|t|) (Intercept) ?0.45737 ? ?0.52499 ? 0.871 ? 0.3893 x ? ? ? ? ? -0.75989 ? ?1.15080 ?-0.660 ? 0.5131 x2 ? ? ? ? ? 1.30987 ? ?0.67330 ? 1.945 ? 0.0594 . x3 ? ? ? ? ?-0.03559 ? ?0.11058 ?-0.322 ? 0.7494 ... Coefficients: ? ? ? ? ? ?Estimate Std. Error t value Pr(>|t|) (Intercept) ? 5.4271 ? ? 0.1434 ?37.839 ?< 2e-16 *** poly(x, 3)1 ?30.0235 ? ? 0.9184 ?32.692 ?< 2e-16 *** poly(x, 3)2 ? 8.7823 ? ? 0.9184 ? 9.563 1.53e-11 *** poly(x, 3)3 ?-0.2956 ? ? 0.9184 ?-0.322 ? ?0.749
?Here, in your illustration, is underscored what I mean. ?Whatever these orthogonal polynomial terms are that you're using, they are clearly not the original x, x^2 and x^3, and they're giving you a different fit than those do. ?I probably ought to learn about this technique, since it looks interesting; but for my purposes I need the fit to actually be in terms of x, since x is my explanatory variable. ?And the fit I'm getting is highly significant (all terms < 2e-16), so the lack of fit problem you're illustrating does not seem to apply to my case. ?Or maybe I'm totally misunderstanding your point...? ?:-> ?Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
"Men by nature long to get on to the ultimate truths, and will often be impatient with elementary studies or fight shy of them. If it were possible to reach the ultimate truths without the elementary studies usually prefixed to them, these would not be preparatory studies but superfluous diversions." -- Maimonides (1135-1204) Bert Gunter Genentech Nonclinical Biostatistics
On Fri, 6 May 2011, Bert Gunter wrote:
FWIW: Fitting higher order polynomials (say > 2) is almost always a bad idea. See e.g. the Hastie, Tibshirani, et. al book on "Statistical Learning" for a detailed explanation why. The Wikipedia entry on "smoothing splines" also contains a brief explanation, I believe. Your ~0 P values for the coefficients also suggest problems/confusion (!) -- perhaps you need to consider something along the lines of "functional data analysis" for your analysis. Having no knowledge of your issues, these remarks are entirely speculative and may well be wrong. So feel free to dismiss. Nevertheless, you may find it useful to consult your local statistician for help.
That is the main piece of advice I would have given. But if you must DIY, consider the merits of orthogonal polynomials. Computing individual confidence intervals for highly correlated coefficients is very dubious practice. Without the example the posting guide asked for, we can only guess if that is what is happening.
Cheers, Bert P.S. The raw = TRUE option of poly will fit raw, not orthogonal, polynomials. The fitted projections will be the same up to numerical error whichever basis is chosen, of course. On Fri, May 6, 2011 at 10:08 AM, Ben Haller <rhelp at sticksoftware.com> wrote:
On May 6, 2011, at 12:31 PM, David Winsemius wrote:
On May 6, 2011, at 11:35 AM, Ben Haller wrote:
Hi all! ?I'm getting a model fit from glm() (a binary logistic regression fit, but I don't think that's important) for a formula that contains powers of the explanatory variable up to fourth. ?So the fit looks something like this (typing into mail; the actual fit code is complicated because it involves step-down and so forth): x_sq <- x * x x_cb <- x * x * x x_qt <- x * x * x * x model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)
It might have been easier with: model <- glm(outcome ~ poly(x, 4) , binomial)
?Interesting. ?The actual model I'm fitting has lots more terms, and needs to be able to be stepped down; sometimes some of the terms of the polynomials will get dropped while others get retained, for example. ?But more to the point, poly() seems to be doing something tricky involving "orthogonal polynomials" that I don't understand. ?I don't think I want whatever that is; I want my original variable x, plus its powers. ?For example, if I do: x < runif(10) poly(x, 3) the columns I get are not x, x^2, x^3; they are something else. ?So the poly() fit is not equivalent to my fit.
Since the authors of confint might have been expecting a poly() formulation the results might be more reliable. I'm just using lm() but I think the conclusions are more general: ... Coefficients: ? ? ? ? ? ?Estimate Std. Error t value Pr(>|t|) (Intercept) ?0.45737 ? ?0.52499 ? 0.871 ? 0.3893 x ? ? ? ? ? -0.75989 ? ?1.15080 ?-0.660 ? 0.5131 x2 ? ? ? ? ? 1.30987 ? ?0.67330 ? 1.945 ? 0.0594 . x3 ? ? ? ? ?-0.03559 ? ?0.11058 ?-0.322 ? 0.7494 ... Coefficients: ? ? ? ? ? ?Estimate Std. Error t value Pr(>|t|) (Intercept) ? 5.4271 ? ? 0.1434 ?37.839 ?< 2e-16 *** poly(x, 3)1 ?30.0235 ? ? 0.9184 ?32.692 ?< 2e-16 *** poly(x, 3)2 ? 8.7823 ? ? 0.9184 ? 9.563 1.53e-11 *** poly(x, 3)3 ?-0.2956 ? ? 0.9184 ?-0.322 ? ?0.749
?Here, in your illustration, is underscored what I mean. ?Whatever these orthogonal polynomial terms are that you're using, they are clearly not the original x, x^2 and x^3, and they're giving you a different fit than those do. ?I probably ought to learn about this technique, since it looks interesting; but for my purposes I need the fit to actually be in terms of x, since x is my explanatory variable. ?And the fit I'm getting is highly significant (all terms < 2e-16), so the lack of fit problem you're illustrating does not seem to apply to my case. ?Or maybe I'm totally misunderstanding your point...? ?:-> ?Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- "Men by nature long to get on to the ultimate truths, and will often be impatient with elementary studies or fight shy of them. If it were possible to reach the ultimate truths without the elementary studies usually prefixed to them, these would not be preparatory studies but superfluous diversions." -- Maimonides (1135-1204) Bert Gunter Genentech Nonclinical Biostatistics
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
On May 6, 2011, at 1:58 PM, Prof Brian Ripley wrote:
On Fri, 6 May 2011, Bert Gunter wrote:
FWIW: Fitting higher order polynomials (say > 2) is almost always a bad idea. See e.g. the Hastie, Tibshirani, et. al book on "Statistical Learning" for a detailed explanation why. The Wikipedia entry on "smoothing splines" also contains a brief explanation, I believe. Your ~0 P values for the coefficients also suggest problems/confusion (!) -- perhaps you need to consider something along the lines of "functional data analysis" for your analysis. Having no knowledge of your issues, these remarks are entirely speculative and may well be wrong. So feel free to dismiss. Nevertheless, you may find it useful to consult your local statistician for help.
That is the main piece of advice I would have given. But if you must DIY, consider the merits of orthogonal polynomials. Computing individual confidence intervals for highly correlated coefficients is very dubious practice. Without the example the posting guide asked for, we can only guess if that is what is happening.
Thanks to both of you. Yes, I am admittedly out of my depth; the statistician I would normally ask is on sabbatical, and I'm a bit at sea. Of course McGill has a whole department of mathematics and statistics; I guess I ought to try to make a friend over there (I'm in the biology department). Anyhow, I've just downloaded the Hastie et al. book and will try to figure out whether my use of higher order polynomials is incorrect in my situation. Eliminating those would certainly solve my problem with the confidence intervals. :-> I was figuring that the ~0 P-values for coefficients was just the result of my having 300,000 data points; I figured the regression procedure was thus able to pin down very accurate estimates of them. I'll look into "functional data analysis" as you recommend, though; I'm entirely unfamiliar with it. As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero. Is this what you mean, as a potential source of problems? Or if you mean that the various other terms in my model might be correlated with x, that is not the case; each independent variable is completely uncorrelated with the others (this data comes from simulations, so the independent variables for each data point were in fact chosen by random drawing). It didn't seem easy to post an example, since my dataset is so large, but if either of you would be willing to look at this further, I could upload my dataset to a web server somewhere and post a link to it. In any case, thanks very much for your help; I'll look into the things you mentioned. Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
On May 6, 2011, at 4:16 PM, Ben Haller wrote:
As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero.
Not just for x close to zero: > cor( (10:20)^2, (10:20)^3 ) [1] 0.9961938 > cor( (100:200)^2, (100:200)^3 ) [1] 0.9966219
Is this what you mean, as a potential source of problems? Or if you mean that the various other terms in my model might be correlated with x, that is not the case; each independent variable is completely uncorrelated with the others (this data comes from simulations, so the independent variables for each data point were in fact chosen by random drawing). It didn't seem easy to post an example, since my dataset is so large, but if either of you would be willing to look at this further, I could upload my dataset to a web server somewhere and post a link to it. In any case, thanks very much for your help; I'll look into the things you mentioned. Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
David Winsemius, MD West Hartford, CT
On May 6, 2011, at 4:27 PM, David Winsemius wrote:
On May 6, 2011, at 4:16 PM, Ben Haller wrote:
As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero.
Not just for x close to zero:
cor( (10:20)^2, (10:20)^3 )
[1] 0.9961938
cor( (100:200)^2, (100:200)^3 )
[1] 0.9966219
Wow, that's very interesting. Quite unexpected, for me. Food for thought. Thanks! Ben Haller McGill University http://biology.mcgill.ca/grad/ben/
On May 7, 2011, at 16:15 , Ben Haller wrote:
On May 6, 2011, at 4:27 PM, David Winsemius wrote:
On May 6, 2011, at 4:16 PM, Ben Haller wrote:
As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero.
Not just for x close to zero:
cor( (10:20)^2, (10:20)^3 )
[1] 0.9961938
cor( (100:200)^2, (100:200)^3 )
[1] 0.9966219
Wow, that's very interesting. Quite unexpected, for me. Food for thought. Thanks!
Notice that because of the high correlations between the x^k, their parameter estimates will be correlated too. In practice, this means that the c.i. for the quartic term contains values for which you can compensate with the other coefficients and still have an acceptable fit to data. (Nothing strange about that; already in simple linear regression, you allow the intercept to change while varying the slope.)
Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
From: pdalgd at gmail.com Date: Sun, 8 May 2011 09:33:23 +0200 To: rhelp at sticksoftware.com CC: r-help at r-project.org Subject: Re: [R] Confidence intervals and polynomial fits On May 7, 2011, at 16:15 , Ben Haller wrote:
On May 6, 2011, at 4:27 PM, David Winsemius wrote:
On May 6, 2011, at 4:16 PM, Ben Haller wrote:
As for correlated coefficients: x, x^2, x^3 etc. would obviously be highly correlated, for values close to zero.
Not just for x close to zero:
cor( (10:20)^2, (10:20)^3 )
[1] 0.9961938
cor( (100:200)^2, (100:200)^3 )
[1] 0.9966219
Wow, that's very interesting. Quite unexpected, for me. Food for thought. Thanks!
Notice that because of the high correlations between the x^k, their parameter estimates will be correlated too. In practice, this means that the c.i. for the quartic term contains values for which you can compensate with the other coefficients and still have an acceptable fit to data. (Nothing strange about that; already in simple linear regression, you allow the intercept to change while varying the slope.)
I was trying to compose a longer message but at least for even/odd it isn't hard to find a set of values for which cor is zero or to find a set of points that make sines of different frequencies have non-zero correlation- this highlights the fact that the computer isn't magic and it needs data to make basis functions different from each other. For background, you probably want to look up "Taylor Series" and "Orthogonal Basis." I would also suggest using R to add noise to your input and see what that does to your predictions or for that matter take simple known data and add noise although I think in principal you can do this analytically. You can always project a signal onto some subspace and get an estimate of how good your estimate is, but that is different from asking how well you can reconstruct your signal from a bunch of projections. If you want to know, "what can I infer about the slope of my thing at x=a" that is a specific question about one coefficient but at this point statisticians can elaborate about various issues with the other things you ignore. Also, I think you said something about correclated at x=0 but you can change your origin, (x-a)^n and expand this in a finite series in x^m to see what happens here. Also, if you are using hotmail don't think that a dot product is not html LOL since hotmail know you must mean html when you use less than even in text email...
-- Peter Dalgaard Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd.mes at cbs.dk Priv: PDalgd at gmail.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.