Skip to content

internal copying in R (soon to be released R-3.1.0

5 messages · Jens Oehlschlägel, Simon Urbanek, Thomas Lumley

#
Dear core group,

Which operation in R guarantees to get a true copy of an atomic vector, 
not just a second symbol pointing to the same shared memory?

y <- x[]
#?

y <- x
y[1] <- y[1]
#?

Is there any function that returns its argument as a non-shared atomic 
but only copies if the argument was shared?

Given an atomic vector x, what is the best official way to find out 
whether other symbols share the vector RAM? Querying NAMED() < 2 doesn't 
work because .Call sets sxpinfo_struct.named to 2. It even sets it to 2 
if the argument to .Call was a never-named expression!?

 > named(1:3)
[1] 2

And it seems to set it permanently, pure read-access can trigger 
copy-on-modify:

 > x <- integer(1e8)
 > system.time(x[1]<-1L)
        User      System verstrichen
           0           0           0
 > system.time(x[1]<-2L)
        User      System verstrichen
           0           0           0

having called .Call now leads to an unnecessary copy on the next assignment

 > named(x)
[1] 2
 > system.time(x[1]<-3L)
        User      System verstrichen
        0.14        0.07        0.20
 > system.time(x[1]<-4L)
        User      System verstrichen
           0           0           0

this not only happens with user written functions doing read-access

 > is.unsorted(x)
[1] TRUE
 > system.time(x[1]<-5L)
        User      System verstrichen
        0.11        0.09        0.21

Why don't you simply give package authors read-access to 
sxpinfo_struct.named in .Call (without setting it to 2)? That would give 
us more control and also save some unnecessary copying. I guess once R 
switches to reference-counting preventive increasing in .Call could not 
be continued anyhow.

Kind regards


Jens Oehlschl?gel

P.S. please cc me in answers as I am not member of r-devel


P.P.S. function named() was tentatively defined as follows:

named <- function(x)
   .Call("R_bit_named", x, PACKAGE="bit")

SEXP R_bit_named(SEXP x){
   SEXP ret_;
   PROTECT( ret_ = allocVector(INTSXP,1) );
   INTEGER(ret_)[0] = NAMED(x);
   UNPROTECT(1);
   return ret_;
}


 > version
                _
platform       x86_64-w64-mingw32
arch           x86_64
os             mingw32
system         x86_64, mingw32
status         Under development (unstable)
major          3
minor          1.0
year           2014
month          02
day            28
svn rev        65091
language       R
version.string R Under development (unstable) (2014-02-28 r65091)
nickname       Unsuffered Consequences
1 day later
#
On Mar 2, 2014, at 12:37 PM, Jens Oehlschl?gel <jens.oehlschlaegel at truecluster.com> wrote:

            
None, there is no concept of "shared" memory at R level. You seem to be mixing C level API specifics and the R language. In the former duplicate() creates a new copy.
Assuming that you are talking about the C API, please consider reading about the concepts involved. .Call() doesn't set named to 2 at all - it passes whatever object is passed so it is the C code's responsibility to handle incoming objects according to the desired semantics (see the previous post here).
Again, you're barking up the wrong tree - .Call() doesn't bump NAMED at all - it simply passes the object:

#include <Rinternals.h>
SEXP nam(SEXP x) { return ScalarInteger(NAMED(x)); }
[1] 0
[1] 1
[1] 2

Cheers,
Simon
#
Thanks for answering Simon,

 > None, there is no concept of "shared" memory at R level. You seem to 
be mixing C level API specifics and the R language. In the former 
duplicate() creates a new copy.

I take this as evidence that calling duplicate() is the only way to make 
sure I have a non-shared object.

 > Assuming that you are talking about the C API, please consider 
reading about the concepts involved. .Call() doesn't set named to 2 at 
all - it passes whatever object is passed so it is the C code's 
responsibility to handle incoming objects according to the desired 
semantics (see the previous post here).

Well, I did read, for example "Writing R Extensions" (Version 3.1.0 
Under development (2014-02-28)) chapter "5.9.10 Named objects and 
copying" which says "Currently all arguments to a .Call call will have 
NAMED set to 2, and so users must assume that they need to be duplicated 
before alteration." This is consistent with the observation of my test 
code: that NAMED() in .Call always returns 2. And that a .Call doing 
pure read access will trigger some delay most likely due to a full 
vector copy is a sign of .Call not only setting NAMED to 2 but also not 
resetting it once .Call terminates.

So what is needed to find NAMED(SEXP argument) < 2 during .Call?

Kind regards

Jens
#
Jens,
On Mar 3, 2014, at 3:35 PM, Jens Oehlschl?gel <jens.oehlschlaegel at truecluster.com> wrote:

            
If NAMED > 0 then calling duplicate() is necessary to make sure you have a non-shared copy.
Matthew pointed out that line and I cannot shed more light on it, since it's not true - at least not currently.
It is not - you're not testing .Call() - your'e testing the assignments in frames which cause additional bumps of NAMED. If you actually test .Call() you'll see what I have reported - .Call() itself does NOT affect NAMED.
Again, as I said earlier, you're on the wrong track here - .Call() doesn't touch it - it is left to the C code. Note that NAMED cannot be decremented (unless you use a ref counting version of R) once it reaches 2 since that means "two or more" so. The only time where you can decrement it is if you are the owner that set it from 0 to 1.

Cheers,
Simon
14 days later