Assigning NULL to large variables is much faster than rm() - any reason why I should still use rm()?
On Sat, May 25, 2013 at 4:38 PM, Simon Urbanek
<simon.urbanek at r-project.org> wrote:
On May 25, 2013, at 3:48 PM, Henrik Bengtsson wrote:
Hi, in my packages/functions/code I tend to remove large temporary variables as soon as possible, e.g. large intermediate vectors used in iterations. I sometimes also have the habit of doing this to make it explicit in the source code when a temporary object is no longer needed. However, I did notice that this can add a noticeable overhead when the rest of the iteration step does not take that much time. Trying to speed this up, I first noticed that rm(list="a") is much faster than rm(a). While at it, I realized that for the purpose of keeping the memory footprint small, I can equally well reassign the variable the value of a small object (e.g. a <- NULL), which is significantly faster than using rm().
Yes, as you probably noticed rm() is a quite complex function because it has to deal with different ways to specify input etc.
When you remove that overhead (by calling .Internal(remove("a", parent.frame(), FALSE))), you get the same performance as the assignment.
If you really want to go overboard, you can define your own function:
SEXP rm(SEXP x, SEXP rho) { setVar(x, R_UnboundValue, rho); return R_NilValue; }
poof <- function(x) .Call(rm_C, substitute(x), parent.frame())
That will be faster than anything else (mainly because it avoids the trip through strings as it can use the symbol directly).
Thanks for this one. This is useful - I did try to follow where .Internal(remove, ...), but got lost in the internal structures. Of course, I'd love to see such a function in 'base' itself. Having such a well defined and narrow function for removing a variable in the current environment may also be useful for 'codetools'/'R CMD check' such that code inspection can detect undefined variables in the case they used to be defined but later have been removed. Technically rm() allows for that too, but I can see how such a task quickly gets complicated when arguments 'list', 'envir' and 'inherits' are involved.
But as Bill noted - it practice I'd recommend using either local() or functions to control the scope - using rm() or assignments seems too error-prone to me.
I didn't mention it, but another reason I use rm() a lot is actually so R can catch my programming mistakes (I'm maintaining 100,000+ lines of code), i.e. the opposite to being error prone. For instance, by doing rm(tmp) as soon as possible, R will give me the run-time error "Error: object 'tmp' not found" in case I use it by mistake later on. As said above, potential the codetools/'R CMD check' will be able to detect this already at check time [above]. With tmp <- NULL I'll loose a bit of this protection, although another run-time error is likely to occur a bit later. Using local()/local functions are obviously alternatives for the above. Thanks both (and sorry about the game - though it was an entertaining one) /Henrik
Cheers, Simon
SOME BENCHMARKS:
A toy example imitating an iterative algorithm with "large" temporary objects.
x <- matrix(rnorm(100e6), ncol=10e3)
t1 <- system.time(for (k in 1:ncol(x)) {
a <- x[,k]
colSum <- sum(a)
rm(a) # Not needed anymore
b <- x[k,]
rowSum <- sum(b)
rm(b) # Not needed anymore
})
t2 <- system.time(for (k in 1:ncol(x)) {
a <- x[,k]
colSum <- sum(a)
rm(list="a") # Not needed anymore
b <- x[k,]
rowSum <- sum(b)
rm(list="b") # Not needed anymore
})
t3 <- system.time(for (k in 1:ncol(x)) {
a <- x[,k]
colSum <- sum(a)
a <- NULL # Not needed anymore
b <- x[k,]
rowSum <- sum(b)
b <- NULL # Not needed anymore
})
t1
user system elapsed 8.03 0.00 8.08
t1/t2
user system elapsed 1.322900 0.000000 1.320261
t1/t3
user system elapsed
1.715812 0.000000 1.662551
Is there a reason why I shouldn't assign NULL instead of using rm()?
As far as I understand it, the garbage collector will be equally
efficient cleaning out the previous object when using rm(a) or a <-
NULL. Is there anything else I'm overlooking? Am I adding overhead
somewhere else?
/Henrik
PS. With the above toy example one can obviously be a bit smarter by using:
t4 <- system.time({for (k in 1:ncol(x)) {
a <- x[,k]
colSum <- sum(a)
a <- x[k,]
rowSum <- sum(a)
}
rm(list="a")
})
but that's not my point.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel