model.frame() call from inside a function (PR#3671) - R-devel

Wed, Aug 6, 2003 5:41 PM #

R version: 1.7.1
OS: Red Hat Linux 7.2

Hi all,

The formula object in model.frame() is not retrieved properly when 
model.frame() is called from within a function and the "subset" argument 
is supplied.

foo <- function(formula,data,subset=NULL)
{
  cat("\n*****Does formula[-3] == ~y ?**** TRUE *****\n")
  print(formula[-3] == ~y)

  cat("\n*****Result of model.frame() using formula[-3]**** FAIL *****\n")
  print(try(model.frame(formula[-3],data=data,subset=subset)))

  cat("\n*****Result of model.frame() using ~y**** WORKS *****\n")
  print(try(model.frame(~y,data=data,subset=subset)))
}
dat <- data.frame(y=c(5,25))
foo(y~1,dat)

Curiously, if the "subset" argument is removed from the call to 
model.frame(), then the execution is successful in both cases.

In ?model.frame, one can read:
     Variables in the formula, `subset' and in `...' are looked for
     first in `data' and then in the environment of `formula': see the
     help for `formula()' for further details.

However, replacing the line
    subset <- eval(substitute(subset), data, env)
by
    subset <- eval(substitute(subset), data, environment())
in model.frame.default() fixes this problem. I don't know if this 
correction would create more problems in other cases. Perhaps there is a 
better fix.

Sincerely,
Jerome Asselin

Jerome Asselin (JÃ©rÃ´me), Statistical Analyst
British Columbia Centre for Excellence in HIV/AIDS
St. Paul's Hospital, 608 - 1081 Burrard Street
Vancouver, British Columbia, CANADA V6Z 1Y6
Email: jerome@hivnet.ubc.ca
Phone: 604 806-9112   Fax: 604 806-9044

Peter Dalgaard

Thu, Aug 7, 2003 3:07 AM #

jerome@hivnet.ubc.ca writes:

There is really nothing to fix, at least if you go by the rule that it
is only a bug if it behaves contrary to documentation:

There is no "subset" in the environment of "formula", nor in the
"data". If you put one there, the error goes away

*****Does formula[-3] == ~y ?**** TRUE *****
[1] TRUE

*****Result of model.frame() using formula[-3]**** FAIL *****
   y
1  5
2 25

*****Result of model.frame() using ~y**** WORKS *****
  y
1 5

However, notice that it is not the same subset. 

There's a whole area of similar nastiness grouped under the heading of
"nonstandard evaluation rules". The basic issue is that you will often
assume that the variables used for subsetting comes from the same
place as those in the model, e.g. in lm(fat~age,subset=sex=="male").

The problem is that it gets really awkward when a function wants to
compute the subset variable and combine it with a formula passed as an
argument. And it only gets worse when arguments can be both scalar and
vector, e.g.

plot(fat~age, col=as.numeric(sex))
function(mycolor="green") plot(fat~age, col=mycolor)

We have discussed changing this on several occasions, e.g. by
requiring that arguments that need to be evaluated in the formula
environment or the data frame should be either model formulas
themselves or quoted expressions. However, that would break S-PLUS
compatibility and also a large body of existing analysis code.

[[ I did discover yesterday (or maybe I was just reminded...) that we
even have nonstandard nonstandard evaluation rules in some places
(nls() seems to evaluate its model formula in the global environment
even if it is given explicitly within a function:

  f <- function() {
    g <- function(a,x) exp(-a*x)
    nls(y~g(a,x),start=list(a=.1))
  }
  x <- 1:10
  y <- exp(-.12*x)+rnorm(10,sd=.001)
  f()
  Error in eval(expr, envir, enclos) : couldn't find function "g"

Argh...]]

O__  ---- Peter Dalgaard             Blegdamsvej 3  
  c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
 (*) \(*) -- University of Copenhagen   Denmark      Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard@biostat.ku.dk)             FAX: (+45) 35327907

Saikat DebRoy

Thu, Aug 7, 2003 9:08 AM #

On Thursday, Aug 7, 2003, at 04:13 US/Eastern, Peter Dalgaard BSA wrote:

Given that I am, for better or worse,  responsible for a large portion 
of the code in nls, I should make it clear that I did not understand 
the nonstandard evaluation rule in those days and so any nonstandrad 
nonstandrad rule used there is a bug. Now that I understand these 
things a little, I can see that nls does a few things wrong. I think 
the following patch mostly fixes them.


-------------- next part --------------

Thomas Lumley

Thu, Aug 7, 2003 9:14 AM #

On 7 Aug 2003, Peter Dalgaard BSA wrote:

This is the same phenomenon that is documented for lattice graphics and
for lme in my notes on nonstandard evaluation rules.  I think it *is* a
bug.

	-thomas

Saikat DebRoy

Thu, Aug 7, 2003 10:08 AM #

On Thursday, Aug 7, 2003, at 10:14 US/Eastern, Thomas Lumley wrote:

I think this was fixed in lattice a few months ago - from the 
ChangeLog, on March 3rd.

Jerome Asselin

Thu, Aug 7, 2003 1:17 PM #

Thanks for your reply and discussion on the issue. See below for another 
suggestion of a fix.

I have spent some time trying to find a fix which would still work as 
documented:

The problem is that the expression environment(formula) in 
model.frame.default() gives the value:
(1) <environment: R_GlobalEnv> for the call 
model.frame(formula[-3],data=data,subset=subset) ;
(2) <environment: 0x883d288> (or something alike) for the call
model.frame(~y,data=data,subset=subset) .

In case (1), eval(subset, data, env) in model.frame.default() gives the 
subset() function which leads to an error.
In the case (2), it gives the correct value for subset (i.e., NULL in the 
example of my original message).

I wonder why the environment is not the same for both cases. Don't you? 
Perhaps this is where the real problem is, but my current understanding of 
environment() is too limited to make such a claim.

I suggest here another fix which I hope respects the documentation. In 
model.frame.default(), add the line
    formula <- formula(deparse(formula))
just before the line
    env <- environment(formula)
This change will affect the value of environment(formula).

If you make the correction and run the code below, then it should work 
successfully. The question is whether this change still respects the 
documentation. Personally, I think this is safe, because the expression 
eval(subset, data, env) is still evaluated in the environment of 
`formula', despite the fact that this environment has changed.

Sincerely,
Jerome Asselin

foo <- function(formula,data,subset=NULL)
{
  cat("\n*****Does formula[-3] == ~y ?**** TRUE *****\n")
  print(formula[-3] == ~y)

  cat("\n*****Result of model.frame() using formula[-3]**** FAIL *****\n")
  print(try(model.frame(formula[-3],data=data,subset=subset)))

  cat("\n*****Result of model.frame() using ~y**** WORKS *****\n")
  print(try(model.frame(~y,data=data,subset=subset)))
}
dat <- data.frame(y=c(5,25))
foo(y~1,dat)
foo(y~1,dat,subset=1)

####Results after making the correction###

*****Does formula[-3] == ~y ?**** TRUE *****
[1] TRUE

*****Result of model.frame() using formula[-3]**** FAIL *****
   y
1  5
2 25

*****Result of model.frame() using ~y**** WORKS *****
   y
1  5
2 25

*****Does formula[-3] == ~y ?**** TRUE *****
[1] TRUE

*****Result of model.frame() using formula[-3]**** FAIL *****
  y
1 5

*****Result of model.frame() using ~y**** WORKS *****
  y
1 5