Skip to content

Most efficient way to check the length of a variable mentioned in a formula.

9 messages · Gabriel Becker, William Dunlap, Joris Meys +1 more

#
Dear R gurus,

I need to know the length of a variable (let's call that X) that is
mentioned in a formula. So obviously I look for the environment from which
the formula is called and then I have two options:

- using eval(parse(text='length(X)'),
                    envir=environment(formula) )

- using length(get('X'),
            envir=environment(formula) )

a bit of benchmarking showed that the first option is about 20 times
slower, to that extent that if I repeat it 10,000 times I save more than
half a second. So speed is not really an issue here.

Personally I'd go for option 2 as that one is easier to read and does the
job nicely, but with these functions I'm always a bit afraid that I'm
overseeing important details or side effects here (possibly memory issues
when working with larger data).

Anybody an idea what the dangers are of these methods, and which one is the
most robust method?

Thank you
Joris
#
Joris,

For me

length(environment(form)[["x"]])

Was about twice as fast as

length(get("x",environment(form))))

In the year-old version of R (3.0.2) that I have on the virtual machine i'm
currently using.

As for you, the eval method was much slower (though my factor was much
larger than 20)
user  system elapsed
  0.018   0.000   0.018
replicate(10000,length(get("x",environment(form))))})   user  system
elapsed
  0.031   0.000   0.033
envir=environment(form)))})
   user  system elapsed
  4.528   0.003   4.656

I can't speak this second to whether this pattern will hold in the more
modern versions of R I typically use.

~G
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys <jorismeys at gmail.com> wrote:

            

  
    
#
I would use eval(), but I think that most formula-using functions do
it more like the following.

getRHSLength <-
function (formula, data = parent.frame())
{
    rhsExpr <- formula[[length(formula)]]
    rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula))
    length(rhsValue)
}

* use eval() instead of get() so you will find variables are in
ancestral environments
of envir (if envir is an environment), not just envir itself.
* just evaluate the stuff in the formula using the non-standard
evaluation frame,
call length() in the current frame.  Otherwise, if  envir inherits
directly from emptyenv() the 'length' function will not be found.
* use envir=data so it looks first in the data argument for variables
* the enclos argument is used if envir is not an environment and is used to
find variables that are not in envir.

Here are some examples:
  > X <- 1:10
  > getRHSLength(~X)
  [1] 10
  > getRHSLength(~X, data=data.frame(X=1:2))
  [1] 2
  > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame())
  [1] 4
  > getRHSLength((function(){X <- 1:4; ~X})(), data=data.frame(X=1:2))
  [1] 2
  > getRHSLength((function(){X <- 1:4; ~X})(), data=list2env(data.frame()))
  [1] 10
  > getRHSLength((function(){X <- 1:4; ~X})(), data=emptyenv())
  Error in eval(expr, envir, enclos) : object 'X' not found

I think you will see the same lookups if you try analogous things with lm().
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Oct 17, 2014 at 11:04 AM, Joris Meys <jorismeys at gmail.com> wrote:
#
I got the default value for getRHSLength's data argument wrong - it
should be NULL, not parent.env().
   getRHSLength <- function (formula, data = NULL)
   {
       rhsExpr <- formula[[length(formula)]]
       rhsValue <- eval(rhsExpr, envir = data, enclos = environment(formula))
       length(rhsValue)
   }
so that the function firstHalf is found in the following
   > X <- 1:10
   > getRHSLength((function(){firstHalf<-function(x)x[seq_len(floor(length(x)/2))];
~firstHalf(X)})())
   [1] 5


Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Oct 17, 2014 at 11:57 AM, William Dunlap <wdunlap at tibco.com> wrote:
#
Thank you both, great ideas.  William, I see the point of using eval, but
the problem is that I can't evaluate the formula itself yet. I need to know
the length of these variables to create a function that is used to
evaluate. So if I try to evaluate the formula in some way before I created
the function, it will just return an error.

Now I use the attribute variables of the formula terms to get the variables
that -after some more manipulation- eventually will be the model matrix.
Something like this :

afun <- function(formula, ...){

    varnames <- all.vars(formula)
    fenv <- environment(formula)

    txt <- paste('length(',varnames[1],')')
    n <- eval(parse(text=txt), envir=fenv)

    fun <- function(x) x/n

    myterms <- terms(formula)
    eval(attr(myterms, 'variables'))

}

And that should give:
[[1]]
 [1] 11 12 13 14 15 16 17 18 19 20

[[2]]
 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

[[3]]
 [1] 10  9  8  7  6  5  4  3  2  1

It might be I'm walking to Paris over Singapore, but I couldn't find a
better way to do it.

Cheers
Joris
On Fri, Oct 17, 2014 at 10:16 PM, William Dunlap <wdunlap at tibco.com> wrote:

            

  
    
#
In my example function I did not evaluate the formula either, just a part of it.

If you leave off the envir and enclos arguments to eval in your
function you can get surprising (wrong) results.  E.g.,
  > afun(y ~ varnames)
  [[1]]
   [1] 10  9  8  7  6  5  4  3  2  1

  [[2]]
  [1] "y"        "varnames"

If you want to use the variables in data or environment(formula) and
some functions defined in your function, then you could make a child
environment of environment(formula), put your locally defined
functions in it, and use the child environment in the call to eval.
E.g., you code would become
afun2 <- function(formula, ...){

    varnames <- all.vars(formula)
    fenv <- environment(formula)

    n <- length(eval(as.name(varnames[1]), envir=fenv))
    childEnv <- new.env(parent=fenv)
    childEnv$fun <- function(x) x/n

    myterms <- terms(formula)
    eval(attr(myterms, 'variables'), envir=childEnv)
}

Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Fri, Oct 17, 2014 at 1:50 PM, Joris Meys <jorismeys at gmail.com> wrote:
#
Thanks again William, I owe you one!
Cheers
Joris
On Fri, Oct 17, 2014 at 11:36 PM, William Dunlap <wdunlap at tibco.com> wrote:

            

  
    
2 days later
#
On 17/10/2014, 2:23 PM, Gabriel Becker wrote:
Those are different:  get() will look in parent environments, but
indexing an environment won't.

For the original question:  you really have no guarantee that the
length() function will do what you want if you evaluate it in an
environment set by the user, so the approach with get is more robust.

Duncan Murdoch
#
Hi Duncan,

thanks for your reaction. I'm not following completely though what you mean
with "no guarantee that the length() function will do what I want if I
evaluate it in an environment set by the user". I wasn't intending to give
the user the opportunity to set those environments, but is there something
I'm overlooking there?

Cheers
Joris

On Tue, Oct 21, 2014 at 10:17 AM, Duncan Murdoch <murdoch.duncan at gmail.com>
wrote: