finding a faster way to run lm on rows of predictor matrix
Hi, If you are doing repeated fits in this specialized case, you can benefit by skipping some of the fancy overhead and going straight to lm.fit. ## what you had data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))) ## passing the matrices directly to lm.fit (no need to augment the X matrix because you do not want the intercept anyway) data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x = t(predictors[i, , drop = FALSE]), y = regress.y)$residuals)) On my little laptop:
system.time(data.residuals <- t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals))))
user system elapsed 42.84 0.11 45.01
system.time(data.residuals2 <- t(sapply(1:6000, function(i) lm.fit(x = t(predictors[i, , drop = FALSE]), y = regress.y)$residuals)))
user system elapsed 3.76 0.00 3.82 If those timing ratios hold on your system, you should be looking at around 1.2 seconds per run. Which suggests it will take around 2 hours to complete 5,000 runs. Note that this is the sort of task that can be readily parallelized, so if you are on a multicore machine, you could take advantage of that without too much trouble. Cheers, Josh
On Fri, Jul 29, 2011 at 8:30 AM, <cypark at princeton.edu> wrote:
Hi, everyone.
I need to run lm with the same response vector but with varying predictor vectors. (i.e. 1 response vector on each individual 6,000 predictor vectors)
After looking through the R archive, I found roughly 3 methods that has been suggested.
Unfortunately, I need to run this task multiple times(~ 5,000 times) and would like to find a faster way than the existing methods.
All three methods I have bellow run on my machine timed with system.time 13~20 seconds.
The opposite direction of 6,000 response vectors and 1 predictor vectors, that is supported with lm runs really fast ~0.5 seconds.
They are pretty much performing the same number of lm fits, so I was wondering if there was a faster way, before I try to code this up in c++.
thanks!!
## sample data ###
regress.y = rnorm(150)
predictors = matrix(rnorm(6000*150), ncol=150, nrow=6000)
## method 1 ##
data.residuals = t(apply(predictors, 1, function(x)( lm(regress.y ~ -1 + as.vector(x))$residuals)))
user ?system elapsed
?15.076 ? 0.048 ?15.128
## method 2 ##
data.residuals = matrix(rep(0, nrow(predictors) * ncol(predictors)), nrow=nrow(predictors), ncol=ncol(predictors) )
for( i in 1:nrow(predictors)){
? ?pred = as.vector(predictors[i,])
? ?data.residuals[i, ] = lm(regress.y ~ -1 + pred )$residuals
}
?user ?system elapsed
?13.841 ? 0.012 ?13.854
## method 3 ##
library(nlme)
all.data <- data.frame( y=rep(regress.y, nrow(predictors)), x=c(t(predictors)), g=gl(nrow(predictors), ncol(predictors)) )
all.fits <- lmList( y ~ x | g, data=all.data)
data.residuals = matrix( residuals(all.fits), nrow=nrow(predictors), ncol=ncol(predictors))
user ?system elapsed
?36.407 ? 0.484 ?36.892
## the opposite direction, supported by lm ###
lm(t(predictors) ~ -1 + regress.y)$residuals
?user ?system elapsed
?0.500 ? 0.120 ? 0.613
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology University of California, Los Angeles https://joshuawiley.com/