Hi,
I any trying to develop an R function for running 1000's of
regressions very fast. I will omit the technical reasons for this, but
I would like to write code to perform the following:
for(j in 1:ncol(X) ){
fit = myRegression( y ~ age:X[,j] )
}
This uses R's convenient 'formula' functionality to evaluate the
interaction term in the regression.
The issue is that 'myRegression' is very complicated, high overhead, and
takes over arguments which I have omitted for simplicity. Therefore, I
would like to pass the formula "y ~ age:X[,j]" into a Rcpp function, and
construct the relevant matrices in C++ using Rcpp::Environment, and
Rcpp::Language, where I change the value of j each time. Because, this
would require only one entry into my C++ code, I would not have to incur
the overhead each time. I would like to run my analysis with a call like:
# return p-values from fitting ncol(X) regressions
myRegressionWrapper( y ~ age:X[,j], data=X)
or something like this.
Essentially I would to have the nice functionality of lm() in Rcpp to
evaluate:
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action",
"offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
mt <- attr(mf, "terms")
X <- model.matrix(mt, mf, contrasts)
so I can run my custom regression function on y and X quickly for each
value of j.
Do you know how to implement the this functionality in Rcpp or through
some other method?
Thanks,
Gabriel Hoffman, PhD
Biological Statistics and Computational Biology
Cornell University
[Rcpp-devel] Evaluating Formula's in Rcpp
3 messages · Hoffman, Gabriel, Dirk Eddelbuettel
Hi Gabriel,
On 3 May 2013 at 22:04, Gabriel Hoffman wrote:
| Hi,
| I any trying to develop an R function for running 1000's of
| regressions very fast. I will omit the technical reasons for this, but
| I would like to write code to perform the following:
|
| for(j in 1:ncol(X) ){
| fit = myRegression( y ~ age:X[,j] )
| }
|
| This uses R's convenient 'formula' functionality to evaluate the
| interaction term in the regression.
Interesting.
I am not sure you can. You probably have to look at the code for formula(),
model.matrix(), ... and redo it. Which will be a royal pain.
| The issue is that 'myRegression' is very complicated, high overhead, and
| takes over arguments which I have omitted for simplicity. Therefore, I
Formulae evaluation is _very_ expensive. With the various version of
fastLm() that we wrote over the years, I think I do have a "full" benchmark
somewhere--maybe in the RcppArmadillo package example. [ If you can't find it
it is easy to recreate, just calling benchmark() or microbenchmark(). ] The
gist of it is that a) fastLm() is fast when you use X matrix and y vector,
to call fastLm.default() and b) fastLm() is a lot slower when you use the
formula interface -- as R code parses the formula.
| would like to pass the formula "y ~ age:X[,j]" into a Rcpp function, and
| construct the relevant matrices in C++ using Rcpp::Environment, and
| Rcpp::Language, where I change the value of j each time. Because, this
| would require only one entry into my C++ code, I would not have to incur
| the overhead each time. I would like to run my analysis with a call like:
|
| # return p-values from fitting ncol(X) regressions
| myRegressionWrapper( y ~ age:X[,j], data=X)
|
| or something like this.
|
| Essentially I would to have the nice functionality of lm() in Rcpp to
| evaluate:
|
| mf <- match.call(expand.dots = FALSE)
| m <- match(c("formula", "data", "subset", "weights", "na.action",
| "offset"), names(mf), 0L)
| mf <- mf[c(1L, m)]
| mf$drop.unused.levels <- TRUE
| mf[[1L]] <- as.name("model.frame")
| mf <- eval(mf, parent.frame())
| y <- model.response(mf, "numeric")
| mt <- attr(mf, "terms")
| X <- model.matrix(mt, mf, contrasts)
|
| so I can run my custom regression function on y and X quickly for each
| value of j.
|
| Do you know how to implement the this functionality in Rcpp or through
| some other method?
I do not know of a method, which is why my implementation of fastLm, when
using a formula interface, it still slow as R code does the formula parsing
work.
So, but the "No Free Lunch Theorem" wins again.
Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com
Gabriel,
On 3 May 2013 at 21:56, Dirk Eddelbuettel wrote:
| On 3 May 2013 at 22:04, Gabriel Hoffman wrote:
| | Hi,
| | I any trying to develop an R function for running 1000's of
| | regressions very fast. I will omit the technical reasons for this, but
| | I would like to write code to perform the following:
| |
| | for(j in 1:ncol(X) ){
| | fit = myRegression( y ~ age:X[,j] )
| | }
| |
| | This uses R's convenient 'formula' functionality to evaluate the
| | interaction term in the regression.
|
| Interesting.
|
| I am not sure you can. You probably have to look at the code for formula(),
| model.matrix(), ... and redo it. Which will be a royal pain.
One alternative, in case you really just need the age:X[,j] term (and an
intercept), is to create it on the fly.
Create a new C++ matrix 'X' of a column of ones and the product of age and
X[,j], then submit it to fastLm as X along with a proper y. That will be
very fast as you bypass formula evaluation, plus you access the faster
regression code.
Just be careful and check your result to make sure lm() and the formula
interface agree.
Dirk
Dirk Eddelbuettel | edd at debian.org | http://dirk.eddelbuettel.com