Skip to content

Seeing some memory leak with foreach...

5 messages · Jonathan Greenberg, Simon Urbanek, Aaron A. King +1 more

#
r-sig-geo'ers:

I always hate doing this, but the test function/dataset is going to be
hard to pass along to the list.  Basically: I have a foreach call that
has no superassignments or strange environmental manipulations, but
resulted in the nodes showing a slow but steady memory creep over
time.  I was using a parallel backend for foreach via doParallel.  Has
anyone else seen this behavior (unexplained memory creep)?  Is there a
good way to "flush" a node?  I'm trying to embed gc() at the top of my
foreach function, but this process took about 24 hours to get to a
memory overuse stage (multiple iterations would have passed, e.g. the
function would have been called more than one time on a single node)
so I'm not sure if this will work so I figured I'd ask the group about
it.  I've seen other people post about this on various boards with no
clear response/solution to it (gc() apparently didn't work).

Some other notes: there should be no resultant output of data, because
the output is being written from within the foreach function (e.g. the
output of the function that foreach executes is NULL).

I'll see if I can work up a faster executing example later, but wanted
to see if there are some general pointers for dealing with memory
leaks using a parallel system.

--j
#
On Feb 26, 2013, at 9:49 AM, Jonathan Greenberg wrote:

            
Just some general technical notes on memory management:

a) R is pretty good at releasing all objects on gc() - that is typically not a problem (in my experience). If you use 3rd party packages with native code especially accessing external libraries, it is more likely that memory leaks in such packages can become an issue. Second thing to be aware of are environments that are holding objects you'd rather not have them hold. This can be the global workspace or other objects stashed away (models typically contain environments with the data etc.). 

b) Note that gc() alone is of little use unless you make sure unused objects are out of scope. If you run gc() in the middle of a function (or at the end), it will only clear out temporary objects but not objects that have been assigned locally but are unused later. So to make sure you're doing the right thing you may want to split some heavy-lifting into chunks run in a local environment that gets out of scope and you only retain the intermediate result and run gc(). Also this is true even without explicit gc() because there will be implicit GC calls anyway.

c) Even if R releases memory, the OS is often unable to claim it back. This should in theory be no big problem as the memory gets re-used later, but it can grow into a problem if the general memory usage is high or when for some reason the memory gets fragmented a lot (again, for a function call without side-effect that should not be the case, but beware of the side-effects).

Cheers,
Simon
#
Hi Jonathan,

I've run into a similar problem before.  It took more than 2 weeks to
track down, but when I did, it turned out to be associated with
resolving dynamically-linked symbols ('getNativeSymbolInfo').  My code
was doing a lot of this very frequently, which led to memory leak.
Once I found the source of the problem, I reworked my code to avoid
this (a good idea anyway since the symbol-resolution was largely
redundant) and never got to the very bottom of the problem, which may
very well have been in the linux kernel rather than in R itself.  This
may have nothing to do with the problem you're experiencing: if it
does, I hope this note will save you some time.  It would be
interesting to hear about the source of the memory leaks, whatever it
turns out to be.

Have you tried another parallel backend or a parallelization approach
other than 'foreach'?

Aaron
On Tue, Feb 26, 2013 at 9:49 AM, Jonathan Greenberg <jgrn at illinois.edu> wrote:

  
    
#
On Tue, Feb 26, 2013 at 6:49 AM, Jonathan Greenberg <jgrn at illinois.edu> wrote:
Hi Jonathan,

have you tried replacing the foreach(...) by a simple for ... to
verify that the problem really is in the parallel execution, and not
simply in the R code?

I second Simon's suggestion to pay careful attention to possible
side-effects and objects not going out of scope when you think they
should (for example, if something somewhere references the environment
of a function that already completed, the environment and all objects
within it are not out of scope).

Peter
#
Thanks all -- I'm going to try out these suggestions (running a for
loop, checking the function a bit more closely, trying some different
backends) and get back to you!

--j

On Wed, Feb 27, 2013 at 12:36 PM, Peter Langfelder
<peter.langfelder at gmail.com> wrote:
--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
607 South Mathews Avenue, MC 150
Urbana, IL 61801
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007