Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

On Sat, May 25, 2013 at 4:38 PM, Simon Urbanek
On May 25, 2013, at 3:48 PM, Henrik Bengtsson wrote:

Hi,

in my packages/functions/code I tend to remove large temporary
variables as soon as possible, e.g. large intermediate vectors used in
iterations.  I sometimes also have the habit of doing this to make it
explicit in the source code when a temporary object is no longer
needed.  However, I did notice that this can add a noticeable overhead
when the rest of the iteration step does not take that much time.

Trying to speed this up, I first noticed that rm(list="a") is much
faster than rm(a).  While at it, I realized that for the purpose of
keeping the memory footprint small, I can equally well reassign the
variable the value of a small object (e.g. a <- NULL), which is
significantly faster than using rm().

Yes, as you probably noticed rm() is a quite complex function because it has to deal with different ways to specify input etc.
When you remove that overhead (by calling .Internal(remove("a", parent.frame(), FALSE))), you get the same performance as the assignment.
If you really want to go overboard, you can define your own function:

SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; }
poof <- function(x) .Call(rm_C, substitute(x), parent.frame())

That will be faster than anything else (mainly because it avoids the trip through strings as it can use the symbol directly).
Thanks for this one.  This is useful - I did try to follow where
.Internal(remove, ...), but got lost in the internal structures.

Of course, I'd love to see such a function in 'base' itself.  Having
such a well defined and narrow function for removing a variable in the
current environment may also be useful for 'codetools'/'R CMD check'
such that code inspection can detect undefined variables in the case
they used to be defined but later have been removed.  Technically rm()
allows for that too, but I can see how such a task quickly gets
complicated when arguments 'list', 'envir' and 'inherits' are
involved.
But as Bill noted - it practice I'd recommend using either local() or functions to control the scope - using rm() or assignments seems too error-prone to me.
I didn't mention it, but another reason I use rm() a lot is actually
so R can catch my programming mistakes (I'm maintaining 100,000+ lines
of code), i.e. the opposite to being error prone.  For instance, by
doing rm(tmp) as soon as possible, R will give me the run-time error
"Error: object 'tmp' not found" in case I use it by mistake later on.
As said above, potential the codetools/'R CMD check' will be able to
detect this already at check time [above].  With tmp <- NULL I'll
loose a bit of this protection, although another run-time error is
likely to occur a bit later.

Using local()/local functions are obviously alternatives for the above.

Thanks both (and sorry about the game - though it was an entertaining one)

/Henrik
Cheers,
Simon

SOME BENCHMARKS:
A toy example imitating an iterative algorithm with "large" temporary objects.

x <- matrix(rnorm(100e6), ncol=10e3)

t1 <- system.time(for (k in 1:ncol(x)) {
 a <- x[,k]
 colSum <- sum(a)
 rm(a) # Not needed anymore
 b <- x[k,]
 rowSum <- sum(b)
 rm(b) # Not needed anymore
})

t2 <- system.time(for (k in 1:ncol(x)) {
 a <- x[,k]
 colSum <- sum(a)
 rm(list="a") # Not needed anymore
 b <- x[k,]
 rowSum <- sum(b)
 rm(list="b") # Not needed anymore
})

t3 <- system.time(for (k in 1:ncol(x)) {
 a <- x[,k]
 colSum <- sum(a)
 a <- NULL # Not needed anymore
 b <- x[k,]
 rowSum <- sum(b)
 b <- NULL # Not needed anymore
})

t1
  user  system elapsed
  8.03    0.00    8.08
t1/t2
   user   system  elapsed
1.322900 0.000000 1.320261
t1/t3
   user   system  elapsed
1.715812 0.000000 1.662551

Is there a reason why I shouldn't assign NULL instead of using rm()?
As far as I understand it, the garbage collector will be equally
efficient cleaning out the previous object when using rm(a) or a <-
NULL.  Is there anything else I'm overlooking?  Am I adding overhead
somewhere else?

/Henrik

PS. With the above toy example one can obviously be a bit smarter by using:

t4 <- system.time({for (k in 1:ncol(x)) {
 a <- x[,k]
 colSum <- sum(a)
 a <- x[k,]
 rowSum <- sum(a)
}
rm(list="a")
})

but that's not my point.

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?

Thread (4 messages)