Skip to content

Issue with gc() on Ubuntu 20.04

3 messages · John Logsdon, Ivan Krylov

#
Folks

I have come across an issue with gc() hogging the processor according to 
Rprof.

Platform is Ubuntu 20.04 all up to date
R version 4.3.1
libraries: survival, MASS, gtools and openxlsx.

With default gc.auto options, the profiler notes the garbage collector 
as self.pct 99.39%.

So I have tried switching it off using options(gc.auto=Inf) in the R 
session before running my program using source().

This lowered self.pct to 99.36.  Not much there.

After some pondering, I added an options(gc.auto=Inf) at the beginning 
of each function, not resetting it at exit, but expecting the offending 
function(s) to plead guilty.

Not so although it did lower the gc() time to 95.84%.

This was on a 16 core Threadripper 1950X box so I was intending to use 
library parallel but I tried it on my lowly windows box that is years 
old and got it down to 88.07%.

The only thing I can think of is that there are quite a lot of cases 
where a function is generated on the fly as in:

eval(parse(t=paste("dprob <- 
function(x,l,s){",dist.functions[2,][dist.functions[1,]==distn],"(x,l,s)}",sep="")))

I haven't added the options to any of these.

The highest time used by any of my functions is 0.05% - the rest is 
dominated by gc().

There may not be much point in parallising the code until I can reduce 
the garbage collection.

I am not short of memory and would like to disable it fully but despite 
adding to all routines, I haven't managed to do this yet.

Can anyone advise me?

And why is the Linux version so much worse than Windows?

TIA
#
On Sun, 27 Aug 2023 19:54:23 +0100
John Logsdon <j.logsdon at quantex-research.com> wrote:

            
Does the Windows box have the same version of R on it?
This isn't very idiomatic. If you need dprob to call the function named
in dist.functions[2,][dist.functions[1,]==distn], wouldn't it be easier
for R to assign that function straight to dprob?

dprob <- get(dist.functions[2,][dist.functions[1,]==distn])

This way, you avoid the need to parse the code, which is typically not
the fastest part of a programming language.

(Generally in R and other programming languages with recursive data
structures, storing variable names in other variables is not very
efficient. Why not put functions directly into a list?)

Rprof() samples the whole call stack. Can you find out which functions
result in a call to gc()? I haven't experimented with a wide sample of
R code, but I don't usually encounter gc() as a major entry in my
Rprof() outputs.
#
On 27-08-2023 21:02, Ivan Krylov wrote:
Yes, they are both 4.3.1
Agreed but this statement and other similar ones are only assigned once 
in an outer loop.
From the first table, removing all the system functions, it suggests 
that the function do.combx() is mainly guilty.  I have recoded that and 
gc() no longer appears - as it shouldn't with it switched off!  One 
difference was that the new code used the built in combn function while 
the old code used gtools::combinations.  I need gtools::permutations 
elsewhere but that is not time critical.

Thanks Ivan for making me think!