Skip to content

[Rcpp-devel] Evaluating Formula's in Rcpp

3 messages · Hoffman, Gabriel, Dirk Eddelbuettel

#
Hi,
     I any trying to develop an R function for running 1000's of 
regressions very fast.  I will omit the technical reasons for this, but 
I would like to write code to perform the following:

for(j in 1:ncol(X) ){
     fit = myRegression( y ~ age:X[,j] )
}

This uses R's convenient 'formula' functionality to evaluate the 
interaction term in the regression.

The issue is that 'myRegression' is very complicated, high overhead, and 
takes over arguments which I have omitted for simplicity. Therefore, I 
would like to pass the formula "y ~ age:X[,j]" into a Rcpp function, and 
construct the relevant matrices in C++ using Rcpp::Environment, and 
Rcpp::Language, where I change the value of j each time.  Because, this 
would require only one entry into my C++ code, I would not have to incur 
the overhead each time.  I would like to run my analysis with a call like:

# return p-values from fitting ncol(X) regressions
myRegressionWrapper( y ~ age:X[,j], data=X)

or something like this.

Essentially I would to have the nice functionality of lm() in Rcpp to 
evaluate:

mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "subset", "weights", "na.action", 
"offset"), names(mf), 0L)
mf <- mf[c(1L, m)]
mf$drop.unused.levels <- TRUE
mf[[1L]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
y <- model.response(mf, "numeric")
mt <- attr(mf, "terms")
X <- model.matrix(mt, mf, contrasts)

so I can run my custom regression function on y and X quickly for each 
value of j.

Do you know how to implement the this functionality in Rcpp or through 
some other method?

Thanks,

Gabriel Hoffman, PhD
Biological Statistics and Computational Biology
Cornell University
#
Hi Gabriel,
On 3 May 2013 at 22:04, Gabriel Hoffman wrote:
| Hi,
|      I any trying to develop an R function for running 1000's of 
| regressions very fast.  I will omit the technical reasons for this, but 
| I would like to write code to perform the following:
| 
| for(j in 1:ncol(X) ){
|      fit = myRegression( y ~ age:X[,j] )
| }
| 
| This uses R's convenient 'formula' functionality to evaluate the 
| interaction term in the regression.

Interesting. 

I am not sure you can.  You probably have to look at the code for formula(),
model.matrix(), ... and redo it. Which will be a royal pain.
 
| The issue is that 'myRegression' is very complicated, high overhead, and 
| takes over arguments which I have omitted for simplicity. Therefore, I 

Formulae evaluation is _very_ expensive.  With the various version of
fastLm() that we wrote over the years, I think I do have a "full" benchmark
somewhere--maybe in the RcppArmadillo package example. [ If you can't find it
it is easy to recreate, just calling benchmark() or microbenchmark(). ] The
gist of it is that a) fastLm() is fast when you use X matrix and y vector,
to call fastLm.default() and b) fastLm() is a lot slower when you use the
formula interface -- as R code parses the formula.

| would like to pass the formula "y ~ age:X[,j]" into a Rcpp function, and 
| construct the relevant matrices in C++ using Rcpp::Environment, and 
| Rcpp::Language, where I change the value of j each time.  Because, this 
| would require only one entry into my C++ code, I would not have to incur 
| the overhead each time.  I would like to run my analysis with a call like:
| 
| # return p-values from fitting ncol(X) regressions
| myRegressionWrapper( y ~ age:X[,j], data=X)
| 
| or something like this.
| 
| Essentially I would to have the nice functionality of lm() in Rcpp to 
| evaluate:
| 
| mf <- match.call(expand.dots = FALSE)
| m <- match(c("formula", "data", "subset", "weights", "na.action", 
| "offset"), names(mf), 0L)
| mf <- mf[c(1L, m)]
| mf$drop.unused.levels <- TRUE
| mf[[1L]] <- as.name("model.frame")
| mf <- eval(mf, parent.frame())
| y <- model.response(mf, "numeric")
| mt <- attr(mf, "terms")
| X <- model.matrix(mt, mf, contrasts)
| 
| so I can run my custom regression function on y and X quickly for each 
| value of j.
| 
| Do you know how to implement the this functionality in Rcpp or through 
| some other method?

I do not know of a method, which is why my implementation of fastLm, when
using a formula interface, it still slow as R code does the formula parsing
work. 

So, but the "No Free Lunch Theorem" wins again.

Dirk
#
Gabriel,
On 3 May 2013 at 21:56, Dirk Eddelbuettel wrote:
| On 3 May 2013 at 22:04, Gabriel Hoffman wrote:
| | Hi,
| |      I any trying to develop an R function for running 1000's of 
| | regressions very fast.  I will omit the technical reasons for this, but 
| | I would like to write code to perform the following:
| | 
| | for(j in 1:ncol(X) ){
| |      fit = myRegression( y ~ age:X[,j] )
| | }
| | 
| | This uses R's convenient 'formula' functionality to evaluate the 
| | interaction term in the regression.
| 
| Interesting. 
| 
| I am not sure you can.  You probably have to look at the code for formula(),
| model.matrix(), ... and redo it. Which will be a royal pain.

One alternative, in case you really just need the age:X[,j] term (and an
intercept), is to create it on the fly.

Create a new C++ matrix 'X' of a column of ones and the product of age and
X[,j], then submit it to fastLm as X along with a proper y.  That will be
very fast as you bypass formula evaluation, plus you access the faster
regression code.

Just be careful and check your result to make sure lm() and the formula
interface agree.

Dirk