Thanks for your reply, Duncan - you hit the nail on the head (as usual, the problem turned out to sit between the keyboard and the chair :)). My function does return regression models that contain the input formulae together with the associated (big) environment. Peter
On Thu, Jan 3, 2013 at 4:41 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
On 13-01-03 7:01 PM, Peter Langfelder wrote:
Hello all, I am running into a problem with garbage collection not being able to free up all memory. Unfortunately I am unable to provide a minimal self-contained example, although I can provide a self contained example if anyone feels like wading through some 600 lines of code. I would love to isolate the relevant parts from the code but whenever I try to run a simpler example, the problem does not appear. I run an algorithm that repeats the same calculation (on sampled, i.e. different data) in each iteration. Each iteration uses relatively large intermediate objects and calculations but returns a smaller result; these results are then collated and returned from the main function (call it myFnc). The problem is that memory used by the intermediate calculations (it is difficult to say whether it's objects or memory needed for apply calls) does not seem to be freed up even after doing explicit garbage collection using gc() within the loop. Thus, a call of something like result = myFnc(arguments) results is some memory that does not seem allocated to any visible objects and yet is not freed up using gc(): After executing an actual call to the offending function, gc() tells me that Vcells use 538.6 Mb, but the sum of object.size() of all objects listed by ls(all.names = TRUE) is only 183.3 Mb. The thing is that if I remove 'result' using rm(result) and do gc() again, the memory used decreases by a lot.: gc() now reports 110.3 Mb used in Vcells; this roughly corresponds to the sum of the sizes of all objects returned by ls() (after removing 'result'), which is now 108.7 Mb. So used memory went down by something like 428 Mb but the object.size of 'result' is only about 75 Mb. Thus, it seems that the memory used by internal operations in myFun that should be freed up upon the completion of the function call cannot be released by garbage collection until the result of the function call is also removed. Like I said, I tried to replicate this behaviour on simple examples but could not. My question is, is this behaviour to be expected in complicated code, or is it a bug that should be reported? Is there any way around it? Thanks in advance for any insights or pointers.
I doubt if it is a bug. Remember the warning from ?object.size: "Exactly which parts of the memory allocation should be attributed to which object is not clear-cut. This function merely provides a rough indication: it should be reasonably accurate for atomic vectors, but does not detect if elements of a list are shared, for example. (Sharing amongst elements of a character vector is taken into account, but not that between character vectors in a single object.)
If I understand correctly, sharing would inflate the sum of object.size()'s relative to the values returned by gc(), correct? The opposite is happening in my case.
The calculation is of the size of the object, and excludes the space needed to store its name in the symbol table. Associated space (e.g. the environment of a function and what the pointer in a EXTPTRSXP points to) is not included in the calculation." For a simple example:
x <- 1:1000000 object.size(x)
4000024 bytes
e <- new.env() object.size(e)
28 bytes
e$x <- x object.size(e)
28 bytes At the end, e is an environment holding an object of 4 million bytes, but its size is 28 bytes. You'll get environments whenever you return functions from other functions (e.g. what approxfun() does), or when you create formulas, e.g.
f <- function() { x <- 1:1000000
+ y <- rnorm(1000000) + y ~ x + } /
fla <- f() object.size(fla)
372 bytes Now fla is the formula, but the data vectors x and y are part of its environment