dear R experts:
I am contemplating the logic in R's lm() and summary(lm()) statements.
the reason is that I want to extend the functionality of lm to give
me both standardized coefficients and newey-west standard errors and
Ts. I have the code and can stick it at the end of the lm() function
(and for others, I have included my amateurish coding below)---it
works. the big advantage of having the code directly in lm() is that
it frees me from having to reconstruct the X matrix, which is not
trivial given aliases, cross-terms, etc. but I now have some more
philosophical questions. what is the "R way" of extending linear
models? and why is not everything from the summary.lm object in the
lm object to begin with?
I first thought that lm() is trying to store only the basic
information, summary.lm() adds to it, and print.summary.lm() formats
it for the screen. but this is not really true.
x=rnorm(1000); y=rnorm(1000); z=rnorm(1000)
object.size( lm( y ~ x*z ))
271552 bytes
object.size( summary(lm( y ~ x*z )))
73912 bytes
the summary.lm object carries less information than the lm object.
looking at the code in summary.lm(), there does not seem to be
anything that would not easily fit into lm() itself. moreover,
summary.lm must work without x=TRUE in the lm() invocation. so, doing
(my) calculation inside summary.lm() seems harder than doing it inside
lm().
in this case, why not have the contents of summary.lm() inside lm() to
begin with? it would eliminate one layer of complexity. if the user
does not want standard errors because calculation could take too much
time (why? where?), then the lm() function could have a
standard.errors=TRUE argument.
so, primarily I don't understand the logic why R has both an lm and a
summary.lm object. secondarily, I don't understand why my own
additions should not go into lm(), given that I think the the other
coefficients (standard errors, etc.) should, too. third, should I
worry that R's built-in lm() function could change so I should not
replace it? (without a hook in the end, I don't think lm() is easy to
wrap. I need the X matrix. I could write a wrapper function to
invoke lm() with x=TRUE, then do what I need to do, and then remove
'$x' from the lm object. I would just carry more bytes along during
my calculations.)
I hope I am not imposing too much here...
/iaw
--------------
## standardized coefficients, newey-west standard errors, and a hook
for further enhancements
if (stdcoefs) z$stdcoefs <- z$coefficients*apply(x,2,sd)/sd(mf$y)
if (newey.west>=0) {
x.na.omitted <- x
r.na.omitted <- residuals(z)
diagband.matrix <- function(m, ar.terms) {
nomit <- m - ar.terms - 1
mm <- matrix(TRUE, nrow = m, ncol = m)
mm[1:nomit, (ncol(mm) - nomit + 1):ncol(mm)] <-
(lower.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm[(ncol(mm) - nomit + 1):ncol(mm), 1:nomit] <-
(upper.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm
}
invx <- chol2inv(chol(crossprod(x.na.omitted)))
invx.x <- invx %*% t(x.na.omitted)
if (newey.west==0)
resid.matrix <- diag(r.na.omitted^2)
else {
full <- r.na.omitted %*% t(r.na.omitted)
maskmatrix <- diagband.matrix(length(r.na.omitted), newey.west)
resid.matrix <- full * maskmatrix
}
vmat <- invx.x %*% resid.matrix %*% t(invx.x)
z$nw <- newey.west ## the number of AR terms
z$nw.se <- sqrt(diag(vmat)) ## the standard errors
}
if (!is.null(lm.hook)) lm.hook() ## has access to everything that
lm() has already computed
----
Ivo Welch (ivo.welch at gmail.com)
I ended up wrapping my own new "ols()" function in the end. it is my
replacement for lm() and summary.lm. this way, I don't need to alter
internals. in case someone else needs it, it is included. of course,
feel free to ignore.
docs[["ols"]] <- c(Rd= '
@TITLE ols.R
@AUTHOR ivo.welch at gmail.com
@DATE 2013
@DESCRIPTION
adds newey-west and stdandardized coefficients to the lm function,
and adds the summary.lm information at the same time.
@USAGE ols(..., newey.west=0, stdcoefs=TRUE)
@ARGUMENTS
@DETAILS
@SEEALSO
@EXAMPLES
', test= '
x <- rnorm(12); y <- rnorm(12); z <- rnorm(12); x[2] <-NA;
ols( y ~ x + z )
', changes= '
')
ols <- function (..., x = FALSE, newey.west=(0), stdcoefs=TRUE) {
## R is painfully error-tolerant. I prefer reasonable and immediate
error warnings.
stopifnot( (is.vector(newey.west))&(length(newey.west)==1)|(is.numeric(newey.west))
)
stopifnot( (is.vector(stdcoefs))&(length(stdcoefs)==1)|(is.logical(stdcoefs))
)
stopifnot( (is.vector(x))&(length(x)==1)|(is.logical(x)) )
## I wish I could check lm()'s argument, but I cannot.
lmo <- lm(..., x=TRUE)
## note that both the x matrix and the residuals from the model have
their NA's omitted by default
if (newey.west>=0) {
resids <- residuals(lmo)
diagband.matrix <- function(m, ar.terms) {
nomit <- m - ar.terms - 1
mm <- matrix(TRUE, nrow = m, ncol = m)
mm[1:nomit, (ncol(mm) - nomit + 1):ncol(mm)] <-
(lower.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm[(ncol(mm) - nomit + 1):ncol(mm), 1:nomit] <-
(upper.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm
}
invx <- chol2inv(chol(crossprod(lmo$x)))
invx.x <- invx %*% t(lmo$x)
if (newey.west==0)
resid.matrix <- diag(resids^2)
else {
full <- resids %*% t(resids)
maskmatrix <- diagband.matrix(length(resids), newey.west)
resid.matrix <- full * maskmatrix
}
vmat <- invx.x %*% resid.matrix %*% t(invx.x)
nw <- newey.west ## the number of AR terms
nw.se <- sqrt(diag(vmat)) ## the standard errors
}
if (stdcoefs) stdcoefs.v <- lmo$coefficients*apply(lmo$x,2,sd)/sd(lmo$model$y)
full.x.matrix <- if (x) lmo$x else NULL
lmo <- summary(lmo) ## the summary.lm object
if (x) lmo$x <- full.x.matrix
if (stdcoefs) {
lmo$coefficients <- cbind(lmo$coefficients, stdcoefs.v )
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <- "stdcoefs"
}
if (newey.west>=0) {
lmo$coefficients <- cbind(lmo$coefficients, nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("se.nw(",newey.west,")")
lmo$coefficients <- cbind(lmo$coefficients, lmo$coefficients[,1]/nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("Tse.nw(",newey.west,")")
}
lmo
}
I ended up wrapping my own new "ols()" function in the end. it is my
replacement for lm() and summary.lm. this way, I don't need to alter
internals. in case someone else needs it, it is included. of course,
feel free to ignore.
If you use NeweyWest() from "sandwich", the code becomes much shorter:
ols2 <- function (..., newey.west= 0, stdcoefs = TRUE)
{
## model and summary
m <- lm(...)
rval <- summary(m)
## Newey-West standard errors and t-statistic
nw <- sqrt(diag(NeweyWest(m, lag = newey.west, prewhite = FALSE)))
rval$coefficients <- cbind(rval$coefficients,
"NW" = nw, "t (NW)" = coef(m)/nw)
## standardized coefficients
if(stdcoefs) rval$coefficients <- cbind(rval$coefficients, "Std. Coef" =
coef(m) * apply(model.matrix(m), 2, sd) /
sd(model.response(model.frame(m))))
return(rval)
}
The ols2(y ~ x + z) produces output equivalent to ols(y ~ x + z).
Personally, I always just use
m <- lm(....)
coeftest(m, vcov = NeweyWest)
to get the coefficient tests with Newey-West standard errors. By default,
this uses automatic lag selection and prewhitening but you can switch both
off if you want:
coeftest(m, vcov = NeweyWest(m, lag = 0, prewhite = FALSE))
Best,
Z
docs[["ols"]] <- c(Rd= '
@TITLE ols.R
@AUTHOR ivo.welch at gmail.com
@DATE 2013
@DESCRIPTION
adds newey-west and stdandardized coefficients to the lm function,
and adds the summary.lm information at the same time.
@USAGE ols(..., newey.west=0, stdcoefs=TRUE)
@ARGUMENTS
@DETAILS
@SEEALSO
@EXAMPLES
', test= '
x <- rnorm(12); y <- rnorm(12); z <- rnorm(12); x[2] <-NA;
ols( y ~ x + z )
', changes= '
')
ols <- function (..., x = FALSE, newey.west=(0), stdcoefs=TRUE) {
## R is painfully error-tolerant. I prefer reasonable and immediate
error warnings.
stopifnot( (is.vector(newey.west))&(length(newey.west)==1)|(is.numeric(newey.west))
)
stopifnot( (is.vector(stdcoefs))&(length(stdcoefs)==1)|(is.logical(stdcoefs))
)
stopifnot( (is.vector(x))&(length(x)==1)|(is.logical(x)) )
## I wish I could check lm()'s argument, but I cannot.
lmo <- lm(..., x=TRUE)
## note that both the x matrix and the residuals from the model have
their NA's omitted by default
if (newey.west>=0) {
resids <- residuals(lmo)
diagband.matrix <- function(m, ar.terms) {
nomit <- m - ar.terms - 1
mm <- matrix(TRUE, nrow = m, ncol = m)
mm[1:nomit, (ncol(mm) - nomit + 1):ncol(mm)] <-
(lower.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm[(ncol(mm) - nomit + 1):ncol(mm), 1:nomit] <-
(upper.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm
}
invx <- chol2inv(chol(crossprod(lmo$x)))
invx.x <- invx %*% t(lmo$x)
if (newey.west==0)
resid.matrix <- diag(resids^2)
else {
full <- resids %*% t(resids)
maskmatrix <- diagband.matrix(length(resids), newey.west)
resid.matrix <- full * maskmatrix
}
vmat <- invx.x %*% resid.matrix %*% t(invx.x)
nw <- newey.west ## the number of AR terms
nw.se <- sqrt(diag(vmat)) ## the standard errors
}
if (stdcoefs) stdcoefs.v <- lmo$coefficients*apply(lmo$x,2,sd)/sd(lmo$model$y)
full.x.matrix <- if (x) lmo$x else NULL
lmo <- summary(lmo) ## the summary.lm object
if (x) lmo$x <- full.x.matrix
if (stdcoefs) {
lmo$coefficients <- cbind(lmo$coefficients, stdcoefs.v )
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <- "stdcoefs"
}
if (newey.west>=0) {
lmo$coefficients <- cbind(lmo$coefficients, nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("se.nw(",newey.west,")")
lmo$coefficients <- cbind(lmo$coefficients, lmo$coefficients[,1]/nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("Tse.nw(",newey.west,")")
}
lmo
}
coeftest with sandwich is indeed powerful and probably faster. I see
one drawback: it requires at least three more packages: lmtest,
sandwich, and in turn zoo, which do not come with standard R. I also
wonder whether I committed a bug in my own code, or whether there is
another parameter I need to specify in sandwich. my standard error
estimates are very similar but not the same.
if anyone wants to use my code, I am dropping a revised version in
below. it also uses a better way to document the function inline.
eval(parse(text=(docs[["lm"]][["examples"]]))) is nice. you get what
you pay for.
more generally, I am still wondering why we have an lm and a
summary.lm function, rather than just one function and one resulting
object for parsimony, given that the summary.lm function is fast and
does not increase the object size.
----
Ivo Welch (ivo.welch at gmail.com)
docs[["lm"]] <- c(
author='ivo.welch at gmail.com',
date='2013',
description='add newey-west and standardized coefficients to lm(),
and return the summary.lm',
usage='lm(..., newey.west=0, stdcoefs=TRUE)',
arguments='',
details='
Note that when y or x have different observations, the
coef(lm(y~x))*sd(x)/sd(y)
calculations are different from a scale(y) ~ scale(x) regression.
Note that the NeweyWest statistics I get from the sandwich package
are different.
could be my bug, or could be my problem specifying an additional parameter.
Here is my code
library(sandwich)
library(lmtest)
print(coeftest(lmo, vcov = NeweyWest, lag=0, prewhite=FALSE))
',
seealso='stats:::lm, stats:::summary.lm',
examples='
x <- rnorm(12); y <- rnorm(12); z <- rnorm(12); x[2] <-NA;
lm( y ~ x + z )
',
test='eval(parse(text=(docs[["lm"]][["examples"]])))',
changes= '',
version='0.91'
)
################################################################
lm <- function (..., x = FALSE, newey.west=(0), stdcoefs=TRUE) {
## R is painfully error-tolerant. I prefer reasonable and immediate
error warnings.
stopifnot( (is.vector(newey.west))&(length(newey.west)==1)|(is.numeric(newey.west))
)
stopifnot( (is.vector(stdcoefs))&(length(stdcoefs)==1)|(is.logical(stdcoefs))
)
stopifnot( (is.vector(x))&(length(x)==1)|(is.logical(x)) )
## I wish I could check lm()'s argument, but I cannot.
lmo <- stats:::lm(..., x=TRUE)
## note that both the x matrix and the residuals from the model have
their NA's omitted by default
if (newey.west>=0) {
resids <- residuals(lmo)
diagband.matrix <- function(m, ar.terms) {
nomit <- m - ar.terms - 1
mm <- matrix(TRUE, nrow = m, ncol = m)
mm[1:nomit, (ncol(mm) - nomit + 1):ncol(mm)] <-
(lower.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm[(ncol(mm) - nomit + 1):ncol(mm), 1:nomit] <-
(upper.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm
}
invx <- chol2inv(chol(crossprod(lmo$x)))
invx.x <- invx %*% t(lmo$x)
if (newey.west==0)
resid.matrix <- diag(resids^2)
else {
full <- resids %*% t(resids)
maskmatrix <- diagband.matrix(length(resids), newey.west)
resid.matrix <- full * maskmatrix
}
vmat <- invx.x %*% resid.matrix %*% t(invx.x)
nw <- newey.west ## the number of AR terms
nw.se <- sqrt(diag(vmat)) ## the standard errors
}
if (stdcoefs) {
xsd <- apply( lmo$x, 2, sd)
ysd <- sd( lmo$model[,1] )
stdcoefs.v <- lmo$coefficients*xsd/ysd
}
full.x.matrix <- if (x) lmo$x else NULL
lmo <- stats:::summary.lm(lmo) ## add the summary.lm object
if (x) lmo$x <- full.x.matrix
if (stdcoefs) {
lmo$coefficients <- cbind(lmo$coefficients, stdcoefs.v )
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <- "stdcoefs"
}
if (newey.west>=0) {
lmo$coefficients <- cbind(lmo$coefficients, nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("se.nw(",newey.west,")")
lmo$coefficients <- cbind(lmo$coefficients, lmo$coefficients[,1]/nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("T.nw(",newey.west,")")
}
lmo
}
attr(lm, "original") <- stats:::lm
coeftest with sandwich is indeed powerful and probably faster. I see
one drawback: it requires at least three more packages: lmtest,
sandwich, and in turn zoo, which do not come with standard R.
But they should be easy enough to install...
I also wonder whether I committed a bug in my own code, or whether there
is another parameter I need to specify in sandwich. my standard error
estimates are very similar but not the same.
Hmmm, I get the same results (up to machine precision):
## data and model
set.seed(1)
x <- rnorm(12); y <- rnorm(12); z <- rnorm(12); x[2] <-NA
m <- lm(y ~ x + z)
## standard errors
nw1 <- ols(y ~ x + z)$coefficients[, 6]
nw2 <- sqrt(diag(NeweyWest(m, lag = 0, prewhite = FALSE)))
And then I get:
R> nw1 - nw2
(Intercept) x z
0.000000e+00 -1.110223e-16 -2.775558e-17
By the way: With lag=0, these are equivalent to the basic sandwich
standard errors and to the HC0 standard errors:
nw3 <- sqrt(diag(sandwich(m)))
nw4 <- sqrt(diag(vcovHC(m, type = "HC0")))
And then:
R> nw3 - nw2
(Intercept) x z
0 0 0
R> nw4 - nw2
(Intercept) x z
0 0 0
if anyone wants to use my code, I am dropping a revised version in
below. it also uses a better way to document the function inline.
eval(parse(text=(docs[["lm"]][["examples"]]))) is nice. you get what
you pay for.
more generally, I am still wondering why we have an lm and a
summary.lm function, rather than just one function and one resulting
object for parsimony, given that the summary.lm function is fast and
does not increase the object size.
When running many regressions, it may still be useful to not run the
summary every time, I guess.
Best,
Z
----
Ivo Welch (ivo.welch at gmail.com)
docs[["lm"]] <- c(
author='ivo.welch at gmail.com',
date='2013',
description='add newey-west and standardized coefficients to lm(),
and return the summary.lm',
usage='lm(..., newey.west=0, stdcoefs=TRUE)',
arguments='',
details='
Note that when y or x have different observations, the
coef(lm(y~x))*sd(x)/sd(y)
calculations are different from a scale(y) ~ scale(x) regression.
Note that the NeweyWest statistics I get from the sandwich package
are different.
could be my bug, or could be my problem specifying an additional parameter.
Here is my code
library(sandwich)
library(lmtest)
print(coeftest(lmo, vcov = NeweyWest, lag=0, prewhite=FALSE))
',
seealso='stats:::lm, stats:::summary.lm',
examples='
x <- rnorm(12); y <- rnorm(12); z <- rnorm(12); x[2] <-NA;
lm( y ~ x + z )
',
test='eval(parse(text=(docs[["lm"]][["examples"]])))',
changes= '',
version='0.91'
)
################################################################
lm <- function (..., x = FALSE, newey.west=(0), stdcoefs=TRUE) {
## R is painfully error-tolerant. I prefer reasonable and immediate
error warnings.
stopifnot( (is.vector(newey.west))&(length(newey.west)==1)|(is.numeric(newey.west))
)
stopifnot( (is.vector(stdcoefs))&(length(stdcoefs)==1)|(is.logical(stdcoefs))
)
stopifnot( (is.vector(x))&(length(x)==1)|(is.logical(x)) )
## I wish I could check lm()'s argument, but I cannot.
lmo <- stats:::lm(..., x=TRUE)
## note that both the x matrix and the residuals from the model have
their NA's omitted by default
if (newey.west>=0) {
resids <- residuals(lmo)
diagband.matrix <- function(m, ar.terms) {
nomit <- m - ar.terms - 1
mm <- matrix(TRUE, nrow = m, ncol = m)
mm[1:nomit, (ncol(mm) - nomit + 1):ncol(mm)] <-
(lower.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm[(ncol(mm) - nomit + 1):ncol(mm), 1:nomit] <-
(upper.tri(matrix(TRUE, nrow = nomit, ncol = nomit)))
mm
}
invx <- chol2inv(chol(crossprod(lmo$x)))
invx.x <- invx %*% t(lmo$x)
if (newey.west==0)
resid.matrix <- diag(resids^2)
else {
full <- resids %*% t(resids)
maskmatrix <- diagband.matrix(length(resids), newey.west)
resid.matrix <- full * maskmatrix
}
vmat <- invx.x %*% resid.matrix %*% t(invx.x)
nw <- newey.west ## the number of AR terms
nw.se <- sqrt(diag(vmat)) ## the standard errors
}
if (stdcoefs) {
xsd <- apply( lmo$x, 2, sd)
ysd <- sd( lmo$model[,1] )
stdcoefs.v <- lmo$coefficients*xsd/ysd
}
full.x.matrix <- if (x) lmo$x else NULL
lmo <- stats:::summary.lm(lmo) ## add the summary.lm object
if (x) lmo$x <- full.x.matrix
if (stdcoefs) {
lmo$coefficients <- cbind(lmo$coefficients, stdcoefs.v )
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <- "stdcoefs"
}
if (newey.west>=0) {
lmo$coefficients <- cbind(lmo$coefficients, nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("se.nw(",newey.west,")")
lmo$coefficients <- cbind(lmo$coefficients, lmo$coefficients[,1]/nw.se)
colnames(lmo$coefficients)[ncol(lmo$coefficients)] <-
paste0("T.nw(",newey.west,")")
}
lmo
}
attr(lm, "original") <- stats:::lm