Dear R-help
This is a follow-up to my previous post here:
http://groups.google.com/group/r-help-archive/browse_thread/thread/d9b6f87ce06a9fb7/e9be30a4688f239c?lnk=gst&q=dobomode#e9be30a4688f239c
I am working on developing an open-source automated system for running
batch-regressions on very large datasets. In my previous post, I posed
the question of obtaining VIF's from the output of BIGLM. With a lot
of help from Assoc. Professor, Biostatistics Thomas Lumley at
University of Washington, I was able to make significant progress, but
ultimately got stuck. The following post describes the steps and
reasoning I undertook in trying to accomplish this task. Please note
that I am not a statistician so ignore any commentary that seems naive
to you.
A quick intro. The goal is to obtain VIF's (variance inflation
factors) from the regression output of BIGLM. Traditionally, this has
been possible with the regular lm() function. Follows a quick
illustration (the model below is pretty silly, only for illustration
purposes).
Example dataset:
Call:
lm(formula = model, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.4506 -1.6044 -0.1196 1.2193 4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337 18.71788 0.657 0.5181
cyl -0.11144 1.04502 -0.107 0.9161
disp 0.01334 0.01786 0.747 0.4635
hp -0.02148 0.02177 -0.987 0.3350
drat 0.78711 1.63537 0.481 0.6353
wt -3.71530 1.89441 -1.961 0.0633 .
qsec 0.82104 0.73084 1.123 0.2739
vs 0.31776 2.10451 0.151 0.8814
am 2.52023 2.05665 1.225 0.2340
gear 0.65541 1.49326 0.439 0.6652
carb -0.19942 0.82875 -0.241 0.8122
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
VIF's:
vif(reg_lm)
cyl disp hp drat wt qsec
vs am gear carb
15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873
4.648487 5.357452 7.908747
Here is the definition of vif() (courtesy of
http://www.stat.sc.edu/~hitchcock/bodyfatRexample.txt):
vif.lm
function(object, ...) {
V <- summary(object)$cov.unscaled
Vi <- crossprod(model.matrix(object))
nam <- names(coef(object))
if(k <- match("(Intercept)", nam, nomatch = F)) {
v1 <- diag(V)[-k]
v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
nam <- nam[-k]
} else {
v1 <- diag(V)
v2 <- diag(Vi)
warning("No intercept term detected. Results may
surprise.")
}
structure(v1*v2, names = nam)
}
My understanding is that the function uses the unscaled variance-
covariance matrix V and the model matrix Vi to compute the VIF's based
on the equations above.
Now, attempting to do the same using biglm().
Large data regression model: biglm(model, mtcars)
Sample size = 32
Coef (95% CI) SE p
(Intercept) 12.303 -25.132 49.739 18.718 0.511
cyl -0.111 -2.201 1.979 1.045 0.915
disp 0.013 -0.022 0.049 0.018 0.455
hp -0.021 -0.065 0.022 0.022 0.324
drat 0.787 -2.484 4.058 1.635 0.630
wt -3.715 -7.504 0.074 1.894 0.050
qsec 0.821 -0.641 2.283 0.731 0.261
vs 0.318 -3.891 4.527 2.105 0.880
am 2.520 -1.593 6.634 2.057 0.220
gear 0.655 -2.331 3.642 1.493 0.661
carb -0.199 -1.857 1.458 0.829 0.810
The challenge is to adapt the vif() function to work with biglm. I was
able to obtain the model matrix Vi using the following method.
biglm supports the calculation of the Huber/White sandwich covariance
matrix:
Out of this, I can extract the model-based scaled variance-covariance
matrix:
v.scaled <- attr(vcov(reg_biglm), "model-based")
Based on my research (big thanks to Assoc. Professor, Biostatistics
Thomas Lumley at University of Washington):
sandwich = v.scaled ? model_matrix ? v.scaled
It is the model_matrix I need to obtain to compute the VIF's. I do so
by inverting v.scaled and moving them to the other side of the
equation.
At this point, I have both the variance-covariance matrix and the
model matrix, so I should be able to compute the VIF's using the
function above. However, on examining the contents of the variance-
covariance matrix, I see that the one from the lm() function is
scaled, while the one I obtained from the sandwich calculations is
not:
(Intercept) cyl disp hp drat
wt qsec vs am gear carb
(Intercept) 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
cyl 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
disp 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
hp 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
drat 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
wt 7.023544 7.023544 7.023544 7.023544 7.023544 7.023544
7.023544 7.023544 7.023544 7.023544 7.023544
This factor turns out to be the squared value of the regression's
sigma:
summary(reg_lm)$sigma ^ 2
[1] 7.023544
Sigma is defined as:
the square root of the estimated variance of the random error
sigma^2 = 1/(n-p) Sum(w[i] R[i]^2),
where R[i] is the i-th residual, residuals[i].
The ultimate challenge is that sigma is returned by lm(), but not by
biglm(). To calcualte sigma, it looks like I would need to compute
the residuals for the entire dataset which defeats the purpose of biglm
(). Does anybody know if there is an easier way of obtaining sigma?
Any guidance on this subject would be greatly appreciated!