Skip to content

how to control the environment of a formula

8 messages · Duncan Murdoch, Thomas Alexander Gerds, Gabor Grothendieck

#
Dear List

I have experienced that objects generated with one of my packages used
a lot of space when saved on disc (object.size did not show this!).

some debugging revealed that formula and call objects carried the full
environment of subroutines along, including even stuff not needed by the
formula or call. here is a sketch of the problem

,----
| test <- function(x){
|   x <- rnorm(1000000)
|   out <- list()
|   out$f <- a~b
|   out
| }
| v <- test(1)
| save(v,file="~/tmp/v.rda")
| system("ls -lah ~/tmp/v.rda")
| 
| -rw-rw-r-- 1 tag tag 7,4M Apr 18 06:41 /home/tag/tmp/v.rda
`----

I tried to replace line 3 by

,----
| as.formula(a~b,env=emptyenv())
| or
| as.formula(a~b,env=NULL)
`----

without the desired effect. Instead adding either

,----
| environment(out$f) <- emptyenv()
| or
| environment(out$f) <- NULL
`----

has the desired effect (i.e. the saved object size is
shrunken). unfortunately there is a new problem:

,----
| test <- function(x){
|   x <- rnorm(1000000)
|   out <- list()
|   out$f <- a~b
|   environment(out$f) <- emptyenv()
|   out
| }
| d <- data.frame(a=1,b=1)
| v <- test(1)
| model.frame(v$f,data=d)
| 
| Error in eval(expr, envir, enclos) : could not find function "list"
`----

Same with NULL in place of emptyenv()

Finally using .GlobalEnv in place of emptyenv() seems to remove both problems.
My questions:

1)  why does the argument env of as.formula have no effect?
2)  is there a better way to tell formula not to copy unrelated stuff
    into the associated environment?
3)  why does object.size not show the size of the environments that
    formulas can carry along?
    
Regards
Thomas    


--
Thomas A. Gerds -- Assoc. Prof. Department of Biostatistics
University of Copenhagen, ?ster Farimagsgade 5, 1014 Copenhagen, Denmark
Office: CSS-15.2.07 (Gamle Kommunehospital)
tel: 35327914 (sec: 35327901)
#
On 13-04-18 1:09 AM, Thomas Alexander Gerds wrote:
But it will cause other, less obvious problems.  In a formula, the 
symbols mean something.  By setting the environment to .GlobalEnv you're 
changing the meaning.  You'll get nonsense in certain cases when 
functions look up the meaning of those symbols and find the wrong thing. 
  (I don't have an example at hand, but I imagine it would be easy to 
put one together with update().)
Because the first argument already had an associated environment.  You 
passed a ~ b, which is evaluated to a formula; calling as.formula on a 
formula does nothing. The env argument is only used when a new formula 
needs to be constructed.  (You can see this in the source code; 
as.formula is a very simple function.)
Yes, delete it.  For example, you could write your function as

  test <- function(x){
    x <- rnorm(1000000)
    out <- list()
    out$f <- a~b
    rm(x)
    out
  }
Because many objects can share the same environment.  See ?object.size 
for more details.

Duncan Murdoch
#
Dear Duncan 

thank you for taking the time to answer my questions! It will be quite
some work to delete all the objects generated inside the function
... but if there is no other way to avoid a large environment then this
is what I will do.

Cheers
Thomas

Duncan Murdoch <murdoch.duncan at gmail.com> writes:

  
    
#
On 13-04-18 11:39 AM, Thomas Alexander Gerds wrote:
It's not really that hard.  Use names <- ls() in the function to get a 
list of all of them; remove the names of variables that might be needed 
in the formula (and the name of the formula itself); then use 
rm(list=names) to delete everything else just before returning it.

Duncan Murdoch
#
hmm. I have tested a bit more, and found this perhaps more difficult
solve situation. even though I delete x, since x is part of the output
of the formula, the size of the object is twice as much as it should be:

test <- function(x){
  x <- rnorm(1000000)
  out <- list(x=x)
  rm(x)
  out$f <- as.formula(a~b)
  out
}
v <- test(1)
x <- rnorm(1000000)
save(v,file="~/tmp/v.rda")
save(x,file="~/tmp/x.rda")
system("ls -lah ~/tmp/*.rda")

-rw-rw-r-- 1 tag tag  15M Apr 19 20:52 /home/tag/tmp/v.rda
-rw-rw-r-- 1 tag tag 7,4M Apr 19 20:52 /home/tag/tmp/x.rda

can you solve this as well?

thanks!
thomas

Duncan Murdoch <murdoch.duncan at gmail.com> writes:
#
On 13-04-19 2:57 PM, Thomas Alexander Gerds wrote:
Yes, this is tricky.  The problem is that "out" is in the environment of 
out$f, so you get two copies when you save it.  (I think you won't have 
two copies in memory, because R only makes a copy when it needs to, but 
I haven't traced this.)

Here are two solutions, both have some problems.

1.  Don't put out in the environment:

test <- function(x) {
   x <- rnorm(1000000)
   out$x <- list(x=x)
   out$f <- a ~ b    # the as.formula() was never needed
   # temporarily create a new environment
   local({
     # get a copy of what you want to keep
     out <- out
     # remove everything that you don't need from the formula
     rm(list=c("x", "out"), envir=environment(out$f))
     # return the local copy
     out
   })
}

I don't like this because it is too tricky, but you could probably wrap 
the tricky bits into a little function (a variant on return() that 
cleans out the environment first), so it's probably what I would use if 
I was desperate to save space in saved copies.

2. Never evaluate the formula in the first place, so it doesn't pick up 
the environment:

test <- function(x) {
   x <- rnorm(1000000)
   out$x <- list(x=x)
   out$f <- quote(a ~ b)
   out
}

This is a lot simpler, but it might not work with some modelling 
functions, which would be confused by receiving the model formula 
unevaluated.  It also has the problems that you get with using 
.GlobalEnv as the environment of the formula, but maybe to a slightly 
lesser extent:  rather than having what is possibly the wrong 
environment, it doesn't have one at all.

Duncan Murdoch
#
On Sat, Apr 20, 2013 at 1:44 PM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
An approach along the lines of Duncan's last solution that works with
lm but may or may not work with other regression-style functions is to
use a character string:

fit <- lm("demand ~ Time", BOD)

As long as you are only saving the input you should be OK but if you
are saving the output of lm then you are back to the same problem
since the "lm" object will contain a formula.
[1] "formula"
#
thanks. yes, I was considering to use as.character(f) but your solution
2 is much better -- did not know ' was a R function as well. just
checked: model.frame does not get confused and this will be used to
evaluate formula by all functions in my packages.

however, there could be related problems with memory. I noticed that
some of my processes use unexpectedly much memory. how can one trace
this?

I am not desperate to save diskspace: the problem is that file transfer
and sharing (like dropbox) suffer when each simulation results fills 8M
instead of 130K just because a large data set is invisibly sitting in
the saved file.

Duncan Murdoch <murdoch.duncan at gmail.com> writes: