Skip to content

number of copies

4 messages · Terry Therneau, William Dunlap, Simon Urbanek

#
On Mon, 2011-10-03 at 12:31 -0400, Simon Urbanek wrote:
That is surprising.  This is not true of Splus.  Since Chambers mentions
it as a specific case as well (Programming with Data, p421) I assumed
that R would be at least as intellegent.  He also used the unset()
function to declare that something could be treated like double(n),
i.e., need no further copies. But that always looked like a dangerous
assertion to me and unset has disappeared, perhaps deservedly, from R.

Terry T.
#
Terry,
  In SV4 and the version of S+ based on it, unset(x) removed
the name "x" from the currently active evaluation frame and returned
the value of "x".  This decremented the reference count of the
data by 1.  The .C() and .Call() functions would copy an input only
if its reference count was positive.  If the data pointed to by "x"
was not also referenced by another name (in any frame) then
   .C("Cfunc", unset(x))
would not copy x, but otherwise it would.  Hence unset() was not dangerous
but it could be difficult to make sure it actually avoided a copy.
E.g., if you had
   f <- function(y) g(y)
   g <- function(x) .C("Cfunc", unset(x))
and called
   f(1:1e6)
then g's call to .C would copy the 1:1e6 because it was pointed to
by f's "y".  Changing f to
   f <- function(y) g(unset(y))
would let
   f(1:1e6)
avoid the copy, but if you did
   z <- 1:1e6
   f(z)
you would get the copy.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
#
On Oct 3, 2011, at 2:43 PM, Terry Therneau wrote:

            
Ok, let me clarify - there are two entirely separate issues: one issue is to avoid duplication of closure arguments in the duplicate() sense -- and that has to do with the reference count and closures in general. The other is duplication in the .C() call which is entirely separate.

If you call
.C("foo", double(n))
then the double(n) object doesn't get duplicated in the call, because it has no references [other than the frame], so there is no duplicate() call on it, but .C will still create a duplicate double* object to pass to the C function because of DUP=TRUE. Then a new R object is created for the result from that double* array.

For comparison, in .Call you would use allocVector(REALSXP, n) and thus avoid two copies.

You can try to avoid that with .C(..., DUP=FALSE) but you'll be treading on eggs, because you can't check NAMED or call duplicate() yourself in C code for cases that may need it, so have to be very, very careful to not modify something inadvertently (this has been a known source of bugs that are hard to track, that's why DUP=FALSE is flagged as dangerous in bold face).

Cheers,
Simon
#
My thanks to Bill Dunlap and Simon Urbanek for clarifying many of the
details.  This gives me what I need to go forward.  

  Yes, I will likely convert more and more things to .Call over time.
This clearly gives the most control over excess memory copies. I am
getting more comments from people using survival on huge data sets so
memory usage is an issue I'll be spending more thought on.

  I'm not nearly as negative about .C as Simon.  Part of this is long
experience with C standalone code: one just gets used to the idea that
mismatched args to a subroutine are deadly. A test of all args to .C
(via insertion of a browser call) is part of initial code development.
Another is that constructing the return argument from .Call (a list with
names) is a bit of a pain.  So I will sometimes use dup=F. However, the
opion of R core about .C is clear, so it behooves me to move away from
it.

Thanks again for the useful comments.

Terry Therneau