Skip to content
Prev 274096 / 398506 Next

SLOW split() function

I do not know if stripping down functions is generally recommended,
but it is not too difficult to do if you know that you can make
assumptions.  Here is an example (I also found a fast way to convert
the data table to a matrix, again if some assumptions can be made).
Using the stripped down function, you can get coefficients and
standard errors in less time than you can get just coefficients using
default lm.  It is hugely less flexible.

Cheers,

Josh

##########################################
library(data.table)

## stripped down lm and summary.lm (for standard errors)
minimal.lm <- function(y, x) {
  dims <- dim(x)
  x <- unlist(x, FALSE, FALSE)
  dim(x) <- dims
  obj <- lm.fit(x = x, y = y)
  resvar <- sum(obj$residuals^2)/obj$df.residual
  p <- obj$rank
  R <- .Call("La_chol2inv", x = obj$qr$qr[1L:p, 1L:p, drop = FALSE],
size = p, PACKAGE = "base")
  m <- min(dim(R))
  d <- c(R)[1L + 0L:(m - 1L) * (dim(R)[1L] + 1L)]
  se <- sqrt(d * resvar)
  cbind(coef = obj$coefficients, se)
}

N <- 1000*100
d <- data.table(data.frame( key= as.integer(runif(N, min=1,
max=N/10)), x=rnorm(N), y=rnorm(N) ))  # irregular
## add intercept column
d$int <- 1L
setkey(d, "key"); gc() ## sort and force a garbage collection

cat("N=", N, ".  Size of d=", object.size(d)/1024/1024, "MB\n")
print(system.time(si <- split(seq(nrow(d)), d$key)))

cat("\n\t(b) Regressions:\n")
## using lm
print(system.time(all.2b <- lapply(si, function(.indx) { coef(lm(y ~
x, data=d[.indx,])) })))
## using minimal.lm---faster and gives standard errors
print(system.time(all.2c <- lapply(si, function(.indx) { minimal.lm(y
= d[.indx, y], x = d[.indx, list(int, x)]) })))

#### Timings on my system ####
+ x, data=d[.indx,])) })))
   user  system elapsed
  67.87    0.01   68.46
user  system elapsed
  47.72    0.00   48.00

######################################################################
On Tue, Oct 11, 2011 at 8:56 AM, ivo welch <ivo.welch at gmail.com> wrote: