Skip to content
Prev 381857 / 398502 Next

Orthogonal polynomials used by R

Dear Peter and John,

Many thanks for your prompt replies.

Here is what I was trying to do.  I was trying to build a statistical model
of a given time series using Box Jenkins methodology. The series has 93
data points. Before I analyse the ACF and PACF, I am required to de-trend
the series. The series seems to have an upward trend. I wanted to find out
what order polynomial should I fit the series
without overfitting.  For this I want to use orthogonal polynomials(I think
someone on the internet was talking about preventing overfitting by using
orthogonal polynomials) . This seems to me as a poor man's cross
validation.

So my plan is to keep increasing the degree of the orthogonal polynomials
till the coefficient of the last orthogonal polynomial becomes
insignificant.

Note : If I do NOT use orthogonal polynomials, I will overfit the data set
and I don't think that is a good way to detect the true order of the
polynomial.

Also now that I have detrended the series and built an ARIMA model of the
residuals, now I want to forecast. For this I need to use the original
polynomials and their coefficients.

I hope I was clear and that my methodology is ok.

I have another query here :-

Note : If I used cross-validation to determine the order of the polynomial,
I don't get a clear answer.

See here :-
library(boot)
mydf = data.frame(cbind(gdp,x))
d<-(c(
cv.glm(data = mydf,glm(gdp~x),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,2)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,3)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,4)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,5)),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~poly(x,6)),K=10)$delta[1]))
print(d)
## [1] 2.178574e+13 7.303031e+11 5.994783e+11 4.943586e+11 4.596648e+11
## [6] 4.980159e+11

# Here it chooses 5. (but 4 and 5 are kind of similar).


d1 <- (c(
cv.glm(data = mydf,glm(gdp~1+x),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4+x^5),K=10)$delta[1],
cv.glm(data = mydf,glm(gdp~1+x+x^2+x^3+x^4+x^5+x^6),K=10)$delta[1]))

print(d1)
## [1] 2.149647e+13 2.253999e+13 2.182175e+13 2.177170e+13 2.198675e+13
## [6] 2.145754e+13

# here it chooses 1 or 6

Query : Why does it choose 1? Notice : Is this just round off noise / noise
due to sampling error created by Cross Validation when it creates the K
folds? Is this due to the ill conditioned model matrix?

Best Regards,
Ashim.
On Wed, Nov 27, 2019 at 10:41 PM Fox, John <jfox at mcmaster.ca> wrote: