memory management

zz <- data.frame(a=c(1,2,3),b=c(4,5,6))
zz
a b
1 1 4
2 2 5
3 3 6
a <- zz$a
a
[1] 1 2 3
a[2] <- 100
a
[1]   1 100   3
zz
a b
1 1 4
2 2 5
3 3 6

clearly a is a _copy_ of its namesake column in zz.

when was the copy made? when a was modified? at assignment?

is there a way to find out how much memory an object takes?

gc() appears not to reclaim all memory after rm() - anyone can confirm?

thanks!
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://mideasttruth.com http://americancensorship.org
http://www.memritv.org http://jihadwatch.org http://ffii.org
C combines the power of assembler with the portability of assembler.
This should help:
invisible(gc())

m0 <- memory.size()
mem.usage <- function(){invisible(gc()); memory.size() - m0}
Mb.size  <- function(x)print(object.size(x), units="Mb")

zz <- data.frame(a=runif(1000000), b=runif(1000000))
mem.usage()
[1] 15.26
Mb.size(zz)
15.3 Mb
a <- zz$a
mem.usage()
[1] 15.26
Mb.size(a)
7.6 Mb
a[2] <- 100
mem.usage()
[1] 22.89
Mb.size(a)
7.6 Mb

You can see that a <- zz$a really has no impact on your memory usage.
It is when you start modifying it that R needs to store a whole new
object in memory.
zz <- data.frame(a=c(1,2,3),b=c(4,5,6))
zz
?a b
1 1 4
2 2 5
3 3 6
a <- zz$a
a
[1] 1 2 3
a[2] <- 100
a
[1] ? 1 100 ? 3
zz
?a b
1 1 4
2 2 5
3 3 6

clearly a is a _copy_ of its namesake column in zz.

when was the copy made? when a was modified? at assignment?

is there a way to find out how much memory an object takes?

gc() appears not to reclaim all memory after rm() - anyone can confirm?

thanks!

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://mideasttruth.com http://americancensorship.org
http://www.memritv.org http://jihadwatch.org http://ffii.org
C combines the power of assembler with the portability of assembler.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
* Florent D. <sybqry at tznvy.pbz> [2012-02-09 19:26:59 -0500]:

m0 <- memory.size()
Mb.size  <- function(x)print(object.size(x), units="Mb")
indeed, these are very useful, thanks.

ls reports these objects larger than 100k:

behavior : 390.1 Mb
mydf : 115.3 Mb
nb : 0.2 Mb
pl : 1.2 Mb

however, top reports that R uses 1.7Gb of RAM (RSS) - even after gc().
what part of R is using the 1GB of RAM?
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://honestreporting.com http://dhimmi.com
http://jihadwatch.org http://americancensorship.org http://camera.org
Money does not buy happiness, but it helps to make unhappiness comfortable.
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://camera.org http://truepeace.org
http://www.PetitionOnline.com/tap12009/ http://thereligionofpeace.com
Modern man is the missing link between apes and human beings.
This appears to be the sort of query that (with apologies to other R
gurus) only Brian Ripley or Luke Tierney could figure out. R generally
passes by value into function calls (but not *always*), so often
multiple copies of objects are made during the course of calls. I
would speculate that this is what might be going on below -- maybe
even that's what you meant.

Just a guess on my part, of course, so treat accordingly.

-- Bert
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://camera.org http://truepeace.org
http://www.PetitionOnline.com/tap12009/ http://thereligionofpeace.com
Modern man is the missing link between apes and human beings.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
My basic worry is that the GC does not work properly,
i.e., the unreachable data is never collected.
* Bert Gunter <thagre.oregba at trar.pbz> [2012-02-27 14:35:14 -0800]:

This appears to be the sort of query that (with apologies to other R
gurus) only Brian Ripley or Luke Tierney could figure out. R generally
passes by value into function calls (but not *always*), so often
multiple copies of objects are made during the course of calls. I
would speculate that this is what might be going on below -- maybe
even that's what you meant.

Just a guess on my part, of course, so treat accordingly.

-- Bert

On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold <sds at gnu.org> wrote:
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?

Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://jihadwatch.org http://memri.org
http://palestinefacts.org http://truepeace.org http://iris.org.il
I may be getting older, but I refuse to grow up!
My basic worry is that the GC does not work properly,
i.e., the unreachable data is never collected.
Highly unlikely. Such basic inner R code has been well tested over 20
years.  I believe that you merely don't understand the inner guts of
what R is doing here, which is the essence of my response. (Clearly, I
make no claim that I do either).

I suggest you move on.

-- Bert

* Bert Gunter <thagre.oregba at trar.pbz> [2012-02-27 14:35:14 -0800]:

This appears to be the sort of query that (with apologies to other R
gurus) only Brian Ripley or Luke Tierney could figure out. R generally
passes by value into function calls (but not *always*), so often
multiple copies of objects are made during the course of calls. I
would speculate that this is what might be going on below -- maybe
even that's what you meant.

Just a guess on my part, of course, so treat accordingly.

-- Bert

On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold <sds at gnu.org> wrote:
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://jihadwatch.org http://memri.org
http://palestinefacts.org http://truepeace.org http://iris.org.il
I may be getting older, but I refuse to grow up!

Bert Gunter
Genentech Nonclinical Biostatistics

Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
Look into environments that may be stored
with your data.  object.size(obj) does not
report on the size of the environment(s)
associated with obj.  E.g.,

  > f <- function(n) {
  +    d <- data.frame(y=rnorm(n), x1=rnorm(n), x2=rnorm(n))
  +    terms(data=d, y~.)
  + }
  > z <- f(1e6)
  > object.size(z)
  1760 bytes
  > eapply(environment(z), object.size)
  $d
  24000520 bytes

  $n
  32 bytes
That happens because formula objects (like function
objects) contain a reference to the environment in
which they were created and that environmentwill not
be destroyed until the last reference to it is gone.
You might be able write code using, e.g., the codetools
package to walk through your objects looking for all
distinct environments that they reference (directly
and indirectly, via ancestors of environments directly
referenced).  Then you can add up the sizes of things
in those environments.

Another possible reason for your problem is that by using ls()
instead of ls(all=TRUE) you are not looking at datasets
whose names start with a dot.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Sam Steingold
Sent: Tuesday, February 28, 2012 11:58 AM
To: r-help at r-project.org; Bert Gunter
Subject: Re: [R] memory management

My basic worry is that the GC does not work properly,
i.e., the unreachable data is never collected.

* Bert Gunter <thagre.oregba at trar.pbz> [2012-02-27 14:35:14 -0800]:

This appears to be the sort of query that (with apologies to other R
gurus) only Brian Ripley or Luke Tierney could figure out. R generally
passes by value into function calls (but not *always*), so often
multiple copies of objects are made during the course of calls. I
would speculate that this is what might be going on below -- maybe
even that's what you meant.

Just a guess on my part, of course, so treat accordingly.

-- Bert

On Mon, Feb 27, 2012 at 1:03 PM, Sam Steingold <sds at gnu.org> wrote:
It appears that the intermediate data in functions is never GCed even
after the return from the function call.
R's RSS is 4 Gb (after a gc()) and

sum(unlist(lapply(lapply(ls(),get),object.size)))
[1] 1009496520

(less than 1 GB)

how do I figure out where the 3GB of uncollected garbage is hiding?
--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://jihadwatch.org http://memri.org
http://palestinefacts.org http://truepeace.org http://iris.org.il
I may be getting older, but I refuse to grow up!

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
* William Dunlap <jqhaync at gvopb.pbz> [2012-02-28 20:19:06 +0000]:

Look into environments that may be stored with your data.
thanks, but I see nothing like that:

for (n in ls(all.names = TRUE)) {
  o <- get(n)
  print(object.size(o), units="Kb")
  e <- environment(o)
  if (!identical(e,NULL) && !identical(e,.GlobalEnv)) {
    print(e)
    print(eapply(e,object.size))
  }
}
25.8 Kb
0.5 Kb
49.1 Kb
0.1 Kb
30.8 Kb
13.6 Kb
17.4 Kb
59.4 Kb
52.2 Kb
0.1 Kb
3.9 Kb
49.1 Kb
21.2 Kb
0.1 Kb
0.1 Kb
51 Kb
13.2 Kb
53.5 Kb
18.1 Kb
64.3 Kb
25.8 Kb
33.5 Kb
0.1 Kb
0.1 Kb
8 Kb
10 Kb
15.7 Kb
15.6 Kb
9.9 Kb
401672.7 Kb
19.1 Kb
76 Kb
12 Kb
32.4 Kb
156.3 Kb
13.1 Kb
20.5 Kb
21.8 Kb
10.8 Kb

sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size)))
[1] 412351928

i.e., total of data is about 400MB.
why does the process take in access of 1GB?

top: 1235m 1.1g 4452 S    0 14.6   7:12.27 R
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://pmw.org.il http://camera.org
http://dhimmi.com http://palestinefacts.org http://ffii.org
Fighting for peace is like screwing for virginity.
You need to walk through the objects, checking for
environments on each component or attribute of an
object.  You also have to look at the parent.env
of each environment found.  E.g.,
  > f <- function(n) {
  +   d <- data.frame(y = rnorm(n), x = rnorm(n))
  +   lm(y ~ poly(x, 4), data=d)
  + }
  > z <- f(1e5)
  > environment(z)
  NULL
  > object.size(z)
  21610708 bytes
  > sapply(z, object.size)
   coefficients     residuals       effects 
            384       4400104       1200336 
           rank fitted.values        assign 
             32       4400104            56 
             qr   df.residual       xlevels 
        7601232            32           104 
           call         terms         model 
            508          2804       4004276
  > environment(z$terms)
  <environment: 0x0abb86e4>
  > eapply(environment(z$terms), object.size)
  $d
  1600448 bytes

  $n
  32 bytes

Coding this is tedious; the codetools package may make it
easier.  Summing the sizes may well give an overestimate
of the memory actually used, since several objects may
share the same memory.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: Sam Steingold [mailto:sam.steingold at gmail.com] On Behalf Of Sam Steingold
Sent: Tuesday, February 28, 2012 2:56 PM
To: r-help at r-project.org; William Dunlap
Subject: Re: memory management

* William Dunlap <jqhaync at gvopb.pbz> [2012-02-28 20:19:06 +0000]:

Look into environments that may be stored with your data.
thanks, but I see nothing like that:

for (n in ls(all.names = TRUE)) {
  o <- get(n)
  print(object.size(o), units="Kb")
  e <- environment(o)
  if (!identical(e,NULL) && !identical(e,.GlobalEnv)) {
    print(e)
    print(eapply(e,object.size))
  }
}
25.8 Kb
0.5 Kb
49.1 Kb
0.1 Kb
30.8 Kb
13.6 Kb
17.4 Kb
59.4 Kb
52.2 Kb
0.1 Kb
3.9 Kb
49.1 Kb
21.2 Kb
0.1 Kb
0.1 Kb
51 Kb
13.2 Kb
53.5 Kb
18.1 Kb
64.3 Kb
25.8 Kb
33.5 Kb
0.1 Kb
0.1 Kb
8 Kb
10 Kb
15.7 Kb
15.6 Kb
9.9 Kb
401672.7 Kb
19.1 Kb
76 Kb
12 Kb
32.4 Kb
156.3 Kb
13.1 Kb
20.5 Kb
21.8 Kb
10.8 Kb

sum(unlist(lapply(lapply(ls(all.names = TRUE),get),object.size)))
[1] 412351928

i.e., total of data is about 400MB.
why does the process take in access of 1GB?

top: 1235m 1.1g 4452 S    0 14.6   7:12.27 R

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://pmw.org.il http://camera.org
http://dhimmi.com http://palestinefacts.org http://ffii.org
Fighting for peace is like screwing for virginity.
* William Dunlap <jqhaync at gvopb.pbz> [2012-02-28 23:06:54 +0000]:

You need to walk through the objects, checking for environments on
each component or attribute of an object.
so why doesn't object.size do that?
  > f <- function(n) {
  +   d <- data.frame(y = rnorm(n), x = rnorm(n))
  +   lm(y ~ poly(x, 4), data=d)
  + }
I am not doing any modeling. No "~". No formulas.
The whole thing is just a bunch of data frames.
I do a lot of strsplit, unlist, & subsetting, so I could imagine why
the RSS is triple the total size of my data if all the intermediate
results are not released.
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://honestreporting.com http://memri.org
http://jihadwatch.org http://pmw.org.il http://camera.org http://ffii.org
To be popular with ladies one has to be smart, handsome & rich. Or to be a cat.
I do a lot of strsplit, unlist, & subsetting, so I could imagine why
the RSS is triple the total size of my data if all the intermediate
results are not released.
I can only give some generalities about that.  Using lots of
small chunks of memory (like short strings) may cause fragmentation
(wasted space between blocks of memory).  Depending on your operating
system, calling free(pointerToMemoryBlock) may or may not reduce the
virtual memory size of the process, so something like '/bin/ps -o vsize,size'
or Process Explorer may only show the high water mark of memory usage.

Another way to gauge the total size of the visible data and the
environments associated with it is to call save(list=objects(all=TRUE),
compress=FALSE,file="someFile") and look at the size of the file.
Headers probably have a different size in the file than in the process,
but it can give some hints about how much hidden environments are
adding to things.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: Sam Steingold [mailto:sam.steingold at gmail.com] On Behalf Of Sam Steingold
Sent: Wednesday, February 29, 2012 8:42 AM
To: William Dunlap
Cc: r-help at r-project.org
Subject: Re: memory management

* William Dunlap <jqhaync at gvopb.pbz> [2012-02-28 23:06:54 +0000]:

You need to walk through the objects, checking for environments on
each component or attribute of an object.
so why doesn't object.size do that?

  > f <- function(n) {
  +   d <- data.frame(y = rnorm(n), x = rnorm(n))
  +   lm(y ~ poly(x, 4), data=d)
  + }
I am not doing any modeling. No "~". No formulas.
The whole thing is just a bunch of data frames.
I do a lot of strsplit, unlist, & subsetting, so I could imagine why
the RSS is triple the total size of my data if all the intermediate
results are not released.

--
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://honestreporting.com http://memri.org
http://jihadwatch.org http://pmw.org.il http://camera.org http://ffii.org
To be popular with ladies one has to be smart, handsome & rich. Or to be a cat.
Le mercredi 29 f?vrier 2012 ? 11:42 -0500, Sam Steingold a ?crit :
* William Dunlap <jqhaync at gvopb.pbz> [2012-02-28 23:06:54 +0000]:

You need to walk through the objects, checking for environments on
each component or attribute of an object.
so why doesn't object.size do that?

  > f <- function(n) {
  +   d <- data.frame(y = rnorm(n), x = rnorm(n))
  +   lm(y ~ poly(x, 4), data=d)
  + }
I am not doing any modeling. No "~". No formulas.
The whole thing is just a bunch of data frames.
I do a lot of strsplit, unlist, & subsetting, so I could imagine why
the RSS is triple the total size of my data if all the intermediate
results are not released.
I think you're simply hitting a (terrible) OS limitation. Linux is very
often not able to reclaim the memory R has used because it's fragmented.
The OS can only get the pages back if nothing is above them, and most of
the time there is data after the object you remove. I'm not able to give
you a more precise explanation, but that's apparently a known problem
and that's hard to fix.

At least, I can confirm that after doing a lot of merges on big data
frames, R can keep using 3GB of shared memory on my box even if gc()
only reports 500MB currently used. Restarting R makes memory use go down
to the normal expectations.

Regards
* Milan Bouchet-Valat <anyvzvyna at pyho.se> [2012-02-29 18:18:50 +0100]:

I think you're simply hitting a (terrible) OS limitation. Linux is
very often not able to reclaim the memory R has used because it's
fragmented.  The OS can only get the pages back if nothing is above
them, and most of the time there is data after the object you
remove. I'm not able to give you a more precise explanation, but
that's apparently a known problem and that's hard to fix.
compacting garbage collector is our best friend!
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://iris.org.il http://www.memritv.org
http://ffii.org http://honestreporting.com http://jihadwatch.org
To a Lisp hacker, XML is S-expressions with extra cruft.

* Milan Bouchet-Valat <anyvzvyna at pyho.se> [2012-02-29 18:18:50 +0100]:

I think you're simply hitting a (terrible) OS limitation. Linux is
very often not able to reclaim the memory R has used because it's
fragmented.  The OS can only get the pages back if nothing is above
them, and most of the time there is data after the object you
remove. I'm not able to give you a more precise explanation, but
that's apparently a known problem and that's hard to fix.
compacting garbage collector is our best friend!
Which R does not use because of the problems it would create for
external C/Fortran code on which R heavily relies.
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu
*  <yhxr-gvrearl at hvbjn.rqh> [2012-02-29 13:55:25 -0600]:
On Wed, 29 Feb 2012, Sam Steingold wrote:
compacting garbage collector is our best friend!
Which R does not use because of the problems it would create for
external C/Fortran code on which R heavily relies.
Well, you know better, of course.

However, I cannot stop wondering if this really is absolutely necessary.
If you do not call GC while the external C/Fortran code is running, you
should be fine with a compacting garbage collector.
If you access the C/Fortran data (managed by the C/Fortran code), then
it should live in a separate universe from the one managed by R GC.
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.childpsy.net/ http://dhimmi.com http://camera.org
http://iris.org.il http://truepeace.org http://mideasttruth.com
Lisp: it's here to save your butt.