Skip to content

memory management

16 messages · Florent D., Bert Gunter, Sam Steingold +3 more

#
a b
1 1 4
2 2 5
3 3 6
[1] 1 2 3
[1]   1 100   3
a b
1 1 4
2 2 5
3 3 6
clearly a is a _copy_ of its namesake column in zz.

when was the copy made? when a was modified? at assignment?

is there a way to find out how much memory an object takes?

gc() appears not to reclaim all memory after rm() - anyone can confirm?

thanks!
#
This should help:
[1] 15.26
15.3 Mb
[1] 15.26
7.6 Mb
[1] 22.89
7.6 Mb

You can see that a <- zz$a really has no impact on your memory usage.
It is when you start modifying it that R needs to store a whole new
object in memory.
On Thu, Feb 9, 2012 at 5:17 PM, Sam Steingold <sds at gnu.org> wrote:
#
indeed, these are very useful, thanks.

ls reports these objects larger than 100k:

behavior : 390.1 Mb
mydf : 115.3 Mb
nb : 0.2 Mb
pl : 1.2 Mb

however, top reports that R uses 1.7Gb of RAM (RSS) - even after gc().
what part of R is using the 1GB of RAM?
17 days later
#
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?
#
This appears to be the sort of query that (with apologies to other R
gurus) only Brian Ripley or Luke Tierney could figure out. R generally
passes by value into function calls (but not *always*), so often
multiple copies of objects are made during the course of calls. I
would speculate that this is what might be going on below -- maybe
even that's what you meant.

Just a guess on my part, of course, so treat accordingly.

-- Bert
On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold <sds at gnu.org> wrote:

  
    
#
My basic worry is that the GC does not work properly,
i.e., the unreachable data is never collected.

  
    
#
On Tue, Feb 28, 2012 at 11:57 AM, Sam Steingold <sds at gnu.org> wrote:
Highly unlikely. Such basic inner R code has been well tested over 20
years.  I believe that you merely don't understand the inner guts of
what R is doing here, which is the essence of my response. (Clearly, I
make no claim that I do either).

I suggest you move on.

-- Bert

  
    
#
Look into environments that may be stored
with your data.  object.size(obj) does not
report on the size of the environment(s)
associated with obj.  E.g.,

  > f <- function(n) {
  +    d <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n))
  +    terms(data=d, y~.)
  + }
  > z <- f(1e6)
  > object.size(z)
  1760 bytes
  > eapply(environment(z), object.size)
  $d
  24000520 bytes

  $n
  32 bytes
That happens because formula objects (like function
objects) contain a reference to the environment in
which they were created and that environmentwill not
be destroyed until the last reference to it is gone.
You might be able write code using, e.g., the codetools
package to walk through your objects looking for all
distinct environments that they reference (directly
and indirectly, via ancestors of environments directly
referenced).  Then you can add up the sizes of things
in those environments.

Another possible reason for your problem is that by using ls()
instead of ls(all=TRUE) you are not looking at datasets
whose names start with a dot.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
thanks, but I see nothing like that:

for (n in ls(all.names = TRUE)) {
  o <- get(n)
  print(object.size(o), units="Kb")
  e <- environment(o)
  if (!identical(e,NULL) && !identical(e,.GlobalEnv)) {
    print(e)
    print(eapply(e,object.size))
  }
}
25.8 Kb
0.5 Kb
49.1 Kb
0.1 Kb
30.8 Kb
13.6 Kb
17.4 Kb
59.4 Kb
52.2 Kb
0.1 Kb
3.9 Kb
49.1 Kb
21.2 Kb
0.1 Kb
0.1 Kb
51 Kb
13.2 Kb
53.5 Kb
18.1 Kb
64.3 Kb
25.8 Kb
33.5 Kb
0.1 Kb
0.1 Kb
8 Kb
10 Kb
15.7 Kb
15.6 Kb
9.9 Kb
401672.7 Kb
19.1 Kb
76 Kb
12 Kb
32.4 Kb
156.3 Kb
13.1 Kb
20.5 Kb
21.8 Kb
10.8 Kb

sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size)))
[1] 412351928

i.e., total of data is about 400MB.
why does the process take in access of 1GB?

top: 1235m 1.1g 4452 S    0 14.6   7:12.27 R
#
You need to walk through the objects, checking for
environments on each component or attribute of an
object.  You also have to look at the parent.env
of each environment found.  E.g.,
  > f <- function(n) {
  +   d <- data.frame(y = rnorm(n), x = rnorm(n))
  +   lm(y ~ poly(x, 4), data=d)
  + }
  > z <- f(1e5)
  > environment(z)
  NULL
  > object.size(z)
  21610708 bytes
  > sapply(z, object.size)
   coefficients     residuals       effects 
            384       4400104       1200336 
           rank fitted.values        assign 
             32       4400104            56 
             qr   df.residual       xlevels 
        7601232            32           104 
           call         terms         model 
            508          2804       4004276
  > environment(z$terms)
  <environment: 0x0abb86e4>
  > eapply(environment(z$terms), object.size)
  $d
  1600448 bytes

  $n
  32 bytes

Coding this is tedious; the codetools package may make it
easier.  Summing the sizes may well give an overestimate
of the memory actually used, since several objects may
share the same memory.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
so why doesn't object.size do that?
I am not doing any modeling. No "~". No formulas.
The whole thing is just a bunch of data frames.
I do a lot of strsplit, unlist, & subsetting, so I could imagine why
the RSS is triple the total size of my data if all the intermediate
results are not released.
#
I can only give some generalities about that.  Using lots of
small chunks of memory (like short strings) may cause fragmentation
(wasted space between blocks of memory).  Depending on your operating
system, calling free(pointerToMemoryBlock) may or may not reduce the
virtual memory size of the process, so something like '/bin/ps -o vsize,size'
or Process Explorer may only show the high water mark of memory usage.

Another way to gauge the total size of the visible data and the
environments associated with it is to call save(list=objects(all=TRUE),
compress=FALSE,file="someFile") and look at the size of the file.
Headers probably have a different size in the file than in the process,
but it can give some hints about how much hidden environments are
adding to things.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
Le mercredi 29 f?vrier 2012 ? 11:42 -0500, Sam Steingold a ?crit :
I think you're simply hitting a (terrible) OS limitation. Linux is very
often not able to reclaim the memory R has used because it's fragmented.
The OS can only get the pages back if nothing is above them, and most of
the time there is data after the object you remove. I'm not able to give
you a more precise explanation, but that's apparently a known problem
and that's hard to fix.

At least, I can confirm that after doing a lot of merges on big data
frames, R can keep using 3GB of shared memory on my box even if gc()
only reports 500MB currently used. Restarting R makes memory use go down
to the normal expectations.


Regards
#
compacting garbage collector is our best friend!
#
On Wed, 29 Feb 2012, Sam Steingold wrote:

            
Which R does not use because of the problems it would create for
external C/Fortran code on which R heavily relies.
#
Well, you know better, of course.

However, I cannot stop wondering if this really is absolutely necessary.
If you do not call GC while the external C/Fortran code is running, you
should be fine with a compacting garbage collector.
If you access the C/Fortran data (managed by the C/Fortran code), then
it should live in a separate universe from the one managed by R GC.