Skip to content

Moderating consequences of garbage collection when in C

7 messages · Thomas Lumley, Martin Morgan, David Hinds

#
Martin Morgan <mtmorgan at fhcrc.org> wrote:
What a coincidence -- I was just going to post a question about why it
is so slow to create a STRSXP of ~10,000,000 unique elements, each ~10
characters long.  I had noticed that this seemed to show much worse
than linear scaling.  I had not thought of garbage collection as the
culprit -- but indeed it is.  By manipulating the GC trigger, I can
make this operation take as little as 3 seconds (with no GC) or as
long as 76 seconds (with 31 garbage collections).

-- Dave
4 days later
#
dhinds at sonic.net wrote:
I had done some google searches on this issue, since it seemed like it
should not be too uncommon, but the only other hit I could come up
with was a thread from 2006:

https://stat.ethz.ch/pipermail/r-devel/2006-November/043446.html

In any case, one issue with your suggested workaround is that it
requires knowing how much additional storage is needed, which may be
an expensive operation to determine.  I've just tried implementing a
different approach, which is to define two new functions to either
disable or enable GC.  The function to disable GC first invokes
R_gc_full() to shrink the heap as much as possible, then sets a flag.
Then in R_gc_internal(), I first check that flag, and if it is set, I
call AdjustHeapSize(size_needed) and exit immediately.

These calls could be used to bracket any code section that expects to
make lots of calls to R's memory allocator.  The down side is that
this approach requires that all paths out of such a code section
(including error handling) need to take care to unset the GC-disabled
flag.  I think I would want to hear from someone on the R team about
whether they think this is a good idea.

A final alternative might be to provide a vectorized version of mkChar
that would accept a char ** and use one of these methods internally,
rather than exporting the underlying methods as part of R's API.  I
don't know if there are other clear use cases where GC is a serious
bottleneck, besides constructing large vectors of mostly unique
strings.  Such a function would be less generally useful since it 
would require that the full vector of C strings be assembled at one
time.

-- Dave
#
On Tue, Nov 15, 2011 at 8:47 AM, <dhinds at sonic.net> wrote:
If .Call and .C re-enabled the GC on return from compiled code (and
threw some sort of error) that would help contain the potential
damage.

You'd might also  want to re-enable GC if malloc() returned NULL,
rather than giving an out-of-memory error.

  -thomas
#
On 11/14/2011 11:47 AM, dhinds at sonic.net wrote:
I think this is a better approach; mine seriously understated the 
complexity of figuring out required size.
Another place where this comes up is during package load, especially for 
packages with many S4 instances.

   > gcinfo(TRUE)
   > library(Matrix)
   Garbage collection 2 = 1+0+1 (level 0) ...
   7.6 Mbytes of cons cells used (40%)
   1.1 Mbytes of vectors used (18%)
   ...
   Garbage collection 58 = 39+9+10 (level 2) ...
   39.4 Mbytes of cons cells used (75%)
   2.9 Mbytes of vectors used (47%)

and continuing

   > library(IRanges)
   ...
   Garbage collection 89 = 60+14+15 (level 1) ...
   63.1 Mbytes of cons cells used (80%)
   4.3 Mbytes of vectors used (53%)

Also, something like

   > system.time(as.character(1:10000000))
   ...
   Garbage collection 124 = 60+14+50 (level 2) ...
   596.1 Mbytes of cons cells used (95%)
   226.3 Mbytes of vectors used (69%)
      user  system elapsed
   61.908   0.297  62.303

might be an R-level manifestation of the same problem.

Being able to disable / enable the GC seems like a useful patch, and I 
hope this is interesting enough for the R-core team.

A more fundamental issue seems to be garbage collection when there are a 
lot of SEXP in play

   > system.time(gc())
      user  system elapsed
     0.236   0.000   0.236

There's a hierarchy of CHARSXP / STRSXP, so maybe that could be 
exploited in the mark phase?

Martin

  
    
#
Martin Morgan <mtmorgan at fhcrc.org> wrote:
Do you know if this is all happening inside a C function that could
handle disabling and enabling GC?  Or would it require doing this at
the R level?  For testing, I am turning GC on and off at the R level
but I am thinking about where we would need to check for failures to
re-enable GC.  I suppose one approach would be to provide an R wrapper
that would evaluate an expression with GC disabled using tryCatch to
guarantee that it would exit with GC enabled.
I get 6 seconds for this with GC disabled.
I haven't explored whether GC could be made smarter so that this isn't
as big of a hit.  I don't really understand the GC process.

-- Dave
#
On 11/14/2011 01:12 PM, dhinds at sonic.net wrote:
Generally complicated operations across multiple function calls. 
Something like

   f = function() {
     state <- gcdisable(TRUE)
     on.exit(gcdisable(state))
     as.character(1:10000000)
   }

might be used.

Martin
#
Martin Morgan <mtmorgan at fhcrc.org> wrote:

            
Here is how I've implemented the core part of this (for discussion,
not a complete patch)

-- Dave




--- memory.c.orig       2011-04-04 15:05:04.000000000 -0700
+++ memory.c    2011-11-14 15:21:42.000000000 -0800
@@ -98,6 +98,7 @@
 */
 
 static int gc_reporting = 0;
+static int gc_disabled = 0;
 static int gc_count = 0;
 
 #ifdef TESTING_WRITE_BARRIER
@@ -2467,6 +2468,17 @@
     R_gc_internal(size_needed);
 }
 
+SEXP attribute_hidden do_gcdisable(SEXP call, SEXP op, SEXP args,
SEXP rho)
+{
+    int i;
+    SEXP old = ScalarLogical(gc_disabled);
+    checkArity(op, args);
+    i = asLogical(CAR(args));
+    if (i != NA_LOGICAL)
+	gc_disabled = i;
+    return old;
+}
+
 #ifdef _R_HAVE_TIMING_
 double R_getClockIncrement(void);
 void R_getProcTime(double *data);
@@ -2541,6 +2553,14 @@
     SEXP first_bad_sexp_type_sexp = NULL;
     int first_bad_sexp_type_line = 0;
 
+    if (gc_disabled) {
+	AdjustHeapSize(size_needed);
+	if (NO_FREE_NODES() || VHEAP_FREE() < size_needed) {
+	    gc_disabled = 0;
+	    error("Heap adjustment failed -- enabling GC");
+	} else return;
+    }
+
  again:
 
     gc_count++;