Skip to content

'methods' and environments.

5 messages · Laurent Gautier, Luke Tierney, Henrik Bengtsson +1 more

#
Hi,

I have quite some trouble with the package methods.
"Environments" in R are a convenient way to emulate
pointers (and avoid copies of large objects, or of
large collections of objects). So far, so good,
but the package methods is becoming more (and more)
problematic to work with. Up to version R-1.7.0,
slots that were environments were still references
to an environment, but I discovered in a recent 
R-patched that this is not the case any longer:
environments as slots are now copied (increasing
the memory consumption by more than three fold in my case).
The (excessive) duplication (as a simple example
shown below demonstrates it) is now enforced
(as environments are copied too) !!!
## RSS of the R process is about 150MB
used (Mb) gc trigger  (Mb)
Ncells 364813  9.8     667722  17.9
Vcells  85605  0.7   14858185 113.4
## RSS is now about 15 MB
[1] "A"
## The RSS will peak to 705 MB !!!!!!


Are there any plans to make "methods" usable with
large datasets ? 



L.
#
Laurent Gautier wrote:
The memory growth seems real, but its connection to "environments as
slots" is unclear.

The only recent change that sounds relevant is the modification to
ensure that methods are evaluated in an environment that reflects the
lexical scope of the method's definition.  That does create a new
environment for each call to a generic function, but has nothing to do
with slots being environments.

It's possible there is some sort of "memory leak" or extra copying
there, but I'm not familiar enough with the details of that code to say
for sure.

Notice that the following workaround has no bad effects on memory
(suggesting that the extra environment in evaluating generics may in
fact be relevant):

R> setClass("A", representation(a="matrix"))
[1] "A"
R> aa <- matrix(600^2, 50)
R> a1 <- new("A")
R> a1@a <- aa
R> gc()
         used (Mb) gc trigger (Mb)
Ncells 370247  9.9     531268 14.2
Vcells  87522  0.7     786432  6.0



The general solution for dealing with large objects is likely to involve
some extensions to R to allow "reference" objects, for which the
programmer is responsible for any copying.

Environments themselves are not quite adequate for this purpose, since
different "references" to the same environment cannot have different
attributes.

John

  
    
#
On Mon, 2 Jun 2003, John Chambers wrote:

            
That was (just) prior to 1.7.0.
You have managed to store Laurant's 140MB matrix in less than 1MB!:-)

If you use matrix(0, 600^2, 50) you get essentially the same pattern
as Laurant did.
Wrapping them in lists is the easiest way to deal this this.

luke
#
Yes, wrap them up in a list is good. You can not use environments
directly for different reasons. Try to do it, then quit R and save the
workspace and then restart R to reload the workspace and you will see
the problem (at least this was the case for R v1.6.2).

Another comment: A while ago I compared storing environments in lists,
i.e. ref$.env or ref[[".env"]], with storing them as attributes, i.e.
attr(ref, ".env"),  and found that it is faster to retrieve an
environment variable if it is stored as an attribute. This might the
useful to know if your going access your "referenced" data many times.

Best wishes

Henrik Bengtsson

Dept. of Mathematical Statistics @ Centre for Mathematical Sciences
Lund Institute of Technology/Lund University, Sweden 
(Sweden +2h UTC, Melbourne +10 UTC, Calif. -7h UTC)
+46 708 909208 (cell), +46 46 320 820 (home), 
+1 (508) 464 6644 (global fax),
+46 46 2229611 (off), +46 46 2224623 (dept. fax)
h b @ m a t h s . l t h . s e, http://www.maths.lth.se/~hb/
#
Luke Tierney wrote:
Correct.  Oh, well.  Here's the less optimistic version:

R> setClass("A", representation(a="matrix"))
[1] "A"
R> aa = new("A")
R> aa@a <-  matrix(0, 600^2, 50)
R> gc()
           used  (Mb) gc trigger  (Mb)
Ncells   368189   9.9     667722  17.9
Vcells 36086939 275.4   54357610 414.8


A little exploration in gdb didn't show much that was surprising.  Yes,
the code copies the matrix to assign it as a slot, but nothing showed up
that was obviously much different from a similar computation that didn't
use classes & methods.

For example, a "stripped-down" analogue to assigning a slot is to assign
an attribute.  To compare with the above (both from a new R session):

R> tt = list(a=1,b=2)
R> attr(tt, "a") <- matrix(0, 600^2, 50)
R> gc()
           used  (Mb) gc trigger  (Mb)
Ncells   367539   9.9     667722  17.9
Vcells 36086917 275.4   54357588 414.8

The indication is that the two compuations are roughly identical, as one
would hope.  In either case, the behavior seems to be somewhere in
between that of the minimalist assignment of the matrix and the
computations for new("A",...).  Which is what one would expect, if there
is some additional copying going on somewhere in dispatching or
evaluating the method for initialize().

But this hardly seems to justify a diatribe, and it doesn't point to a
likely high-leverage fix.

Without some more specific guidance or ideas, a lot of time could be
spent on this without much chance of profit.
Yes, that's what the current OOP package in Omegahat does.  But it's not
a long-term solution, because now you have a list object, which is not
what you intended.

John