Skip to content

modifying large R objects in place

14 messages · Byron Ellis, Petr Savicky, Brian Ripley +3 more

#
I have a C function, which performs a transformation
of a large integer matrix. The matrix may be of size 1.6GB,
so I can have only one copy in RAM and have to modify it
in place. This is possible using .Call and works fine. For
debugging, I need two copies of a smaller matrix and modify only
one of them. This may also be done, for example, by
  A <- some integer matrix
  B <- A + as.integer(0)
  .Call("transform", A)
Then, B is still the original and A is the transformed one.
Up to now, I do not have any real problem with this, but there
are things, which could help me. Namely the following ones:

1. Is there a general way to duplicate an R object on the
   level of R language? I would use it instead of
      B <- A + as.integer(0)
   Looking for the keywords "duplicate" and "copy" in help pages
   and R-exts did not find what I need.

2. Is there a way how a C function can verify that its
   argument does not share data with other objects?
   I included a test of NAMED(A) in my C function, but I do
   not know, whether it may be used for the purpose which I need.
   I did not find a way how to generate an R object with
   NAMED equal 0. Is this possible on the level of R language?
   If yes, I will require that the argument passed to my
   function is generated in this way and the function
   will refuse to modify it if it is not so.
   The documentation suggests to call "duplicate", if the
   data are shared, but I cannot afford this due to memory.
   So, I can only stop if NAMED is not 0.

I appreciate, if anybody could give me advice on the above things.

Petr Savicky.
#
For the most part, doing anything to an R object result in it's
duplication. You generally have to do a lot of work to NOT copy an R
object.
On 9/26/07, Petr Savicky <savicky at cs.cas.cz> wrote:

  
    
#
On Wed, Sep 26, 2007 at 10:52:28AM -0700, Byron Ellis wrote:
Thank you for your response. Unfortunately, you are right. For example,
the allocated memory determined by top command on Linux may change during
a session as follows:
  a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
  a[1,1] <- 0 # 3.0g
  gc() # 1.5g

In the current applicatin, I modify the matrix only using my own C code
and only read it on R level. So, the above is not a big problem for me
(at least not now).

However, there is a related thing, which could be a bug. The following
code determines the value of NAMED field in SEXP header of an object:

  SEXP getnamed(SEXP a)
  {
      SEXP out;
      PROTECT(out = allocVector(INTSXP, 1));
      INTEGER(out)[0] = NAMED(a);
      UNPROTECT(1);
      return(out);
  }

Now, consider the following session

  u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
  .Call("getnamed",u) # 1 (OK)

  length(u)
  .Call("getnamed",u) # 1 (OK)

  dim(u)
  .Call("getnamed",u) # 1 (OK)

  nrow(u)
  .Call("getnamed",u) # 2 (why?)

  u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
  .Call("getnamed",u) # 1 (OK)
  ncol(u)
  .Call("getnamed",u) # 2 (so, ncol does the same)

Is this a bug?

Petr Savicky.
#
In my previous email, I sent the example:
This is misleading. The correct version is
   a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
   a[1,1] <- as.integer(0) # 1.5g
   gc() # 774m

So, the object duplicates, but nothing more.

The main part of my previous email (question concerning
a possible bug in the behavior of nrow(a) and ncol(a))
remains open.

Petr Savicky.
#
Petr Savicky wrote:
No. It is an infelicity.

The issues are that
1. length() and dim() call .Primitive directly, whereas nrow() and 
ncol() are "real" R functions
2. NAMED records whether an object has _ever_  had  0, 1, or 2+ names

During the evaluation of ncol(u). the argument x is evaluated, and at
that point the object "u" is also named "x" in the evaluation frame of
ncol(). A full(er) reference counting system might drop NAMED back to 1
when exiting ncol(), but currently, R  can only count up (and trying to
find the conditions under which it is safe to reduce NAMED will make
your head spin, believe me! )

  
    
#
1) You implicitly coerced 'a' to be numeric and thereby (almost) doubled 
its size: did you intend to?  Does that explain your confusion?


2) I expected NAMED on 'a' to be incremented by nrow(a): here is my 
understanding.

When you called nrow(a) you created another reference to 'a' in the 
evaluation frame of nrow.  (At a finer level you first created a promise 
to 'a' and then dim(x) evaluated that promise, which did SET_NAMED(<SEXP>) 
= 2.)  So NAMED(a) was correctly bumped to 2, and it is never reduced.

More generally, any argument to a closure that actually gets used will 
get NAMED set to 2.

Having too high a value of NAMED could never be a 'bug'.  See the 
explanation in the R Internals manual:

   When an object is about to be altered, the named field is consulted. A
   value of 2 means that the object must be duplicated before being
   changed.  (Note that this does not say that it is necessary to
   duplicate, only that it should be duplicated whether necessary or not.)


3) Memory profiling can be helpful in telling you exactly what copies get 
made.
On Thu, 27 Sep 2007, Petr Savicky wrote:

            

  
    
#
Thank you very much for all the explanations. In particular for pointing
out that nrow is not a .Primitive unlike dim, which is the
reason for the difference in their behavior. (I rised the question
of possible bug due to this difference, not just being unsatisfied
with nrow). Also, thanks for:
On Thu, Sep 27, 2007 at 05:59:05PM +0100, Prof Brian Ripley wrote:
[...]
[...]

This explains a lot.

I appreciate also the patch to matrix by Henrik Bengtsson, which saved
me time formulating a further question just about this.

I do not know, whether there is a reason to keep nrow, ncol not .Primitive,
but if there is such, the problem may be solved by rewriting
them as follows:

nrow <- function(...) dim(...)[1]
ncol <- function(...) dim(...)[2]

At least in my environment, the new versions preserved NAMED == 1.

It has a side effect that this unifies the error messages generated
by too many arguments to nrow(x) and dim(x). Currently
  a <- matrix(1:6,nrow=2)
  nrow(a,a) # Error in nrow(a, a) : unused argument(s) (1:6)
  dim(a,a) # Error: 2 arguments passed to 'dim' which requires 1

May be, also other solutions exist.

Petr Savicky.
#
Petr Savicky wrote:
Yes, but changing the formal arguments is a bit messy, is it not?

Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works 
too, but if the gain is important enough to warrant that sort of 
programming, you might as well make nrow a .Primitive.

Longer-term, I still have some hope for better reference counting, but 
the semantics of environments make it really ugly -- an environment can 
contain an object that contains the environment, a simple example being 

f <- function()
    g <- function() 0
f()

At the end of f(), we should decide whether to destroy f's evaluation 
environment. In the present example, what we need to be able to see is 
that this would remove all refences to g and that the reference from g 
to f can therefore be ignored.  Complete logic for sorting this out is 
basically equivalent to a new garbage collector, and one can suspect 
that applying the logic upon every function return is going to be 
terribly inefficient. However, partial heuristics might apply.

  
    
#
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
[...]
Specifically for nrow, ncol, I think not much, since almost nobody needs
to know (or even knows) that the name of the formal argument is "x".

However, there is another argument against the ... solution: it solves
the problem only in the simplest cases like nrow, ncol, but is not
usable in other, like colSums, rowSums. These functions also increase
NAMED of its argument, although their output does not contain any
reference to the original content of their arguments.

I think that a systematic solution of this problem may be helpful.
However, making these functions Internal or Primitive would
not be good in my opinion. It is advantageous that these functions
contain an R level part, which
makes the basic decisions before a call to .Internal.
If nothing else, this serves as a sort of documentation.

For my purposes, I replaced calls to "colSums" and "matrix" by the
corresponding calls to .Internal in my script. The result is that
now I can complete several runs of my calculation in a cycle instead
of restarting R after each of the runs.

This leads me to a question. Some of the tests, which I did, suggest
that gc() may not free all the memory, even if I remove all data
objects by rm() before calling gc(). Is this possible or I must have
missed something?

A possible solution to the unwanted increase of NAMED due to temporary
calculations could be to give the user the possibility
to store NAMED attribute of an object before a call to a function
and restore it after the call. To use this, the user should be
confident that no new reference to the object persists after the
function is completed.
You are right. This indeed works.
I have to say that I do not understand the example very much.
What is the input and output of f? Is g inside only defined or
also used?

Let me ask the following question. I assume that gc() scans the whole
memory and determines for each part of data, whether a reference
to it still exists or not. In my understanding, this is equivalent to
determine, whether NAMED of it may be dropped to zero or not.
Structures, for which this succeeds are then removed. Am I right?
If yes, is it possible during gc() to determine also cases,
when NAMED may be dropped from 2 to 1? How much would this increase
the complexity of gc()?

Thank you in advance for your kind reply.

Petr Savicky.
#
On Fri, 28 Sep 2007, Petr Savicky wrote:

            
I believe this is a bug in the evaluation of ... arguments.  THe
intent in the code is I believe that all promise evaluations result in
NAMED==2 for safety.  That may be overly conservative but I would not
want to change it without some very careful thought -- I prefer to
wait a little longer for the right answer than to get a wrong one
quickly.
Not impossible but very unlikely givent he use gc gets. There are a
few internal tables that are grown but not shrunk at the moment but
that should not usually cause much total growth.  If you are ooking at
system memopry use then that is a malloc issue -- there was a thread
about this a month or so ago.
This would be too dangerous for general use. Some more structured
approach may be possible. A related issue is that user-defined
assignment functions always see a NAMED of 2 and hence cannot modify
in place. We've been trying to come up with a reasonable solution to
this, so far without success but I'm moderately hopeful.
Probably not impossible but would be a fair bit of work with probably
not much gain as the NAMED values would still be high until the next
gc of the appropriate level, which will probably be a fair time as an
object being modified is likely to be older, but the interval in which
there would be a benefit is short.

The basic functional model that underlies having the illuison of
non-modifyable vector data does not fit all that well with an
imperative style of modifying things in loops. It might be useful to
bring in some constructs from functional programming that are designed
to allow in-place modification to coexist with functional semantics.
Probably a longer term issue.

For now there are limits to what we can reasonable, and maintainably,
do in an interpreted R.  Having full reference counts might help but
might not because of other costs involved (significant increases in
cache misses in particular) but in any case it would probably be
easier to rewrite R from scratch than to retro-fit full reference
cunting to what we have so I an't see it happening real soon. Also it
doesn't help with many things, like user-level assignment: there
really are two references at the key point in that case.  With
compilation it may be possible to do some memory use analysis and work
out when it is safe to do destructive modification, but that is a fair
way off as well.

Best,

luke

  
    
#
On 9/28/2007 7:45 AM, Petr Savicky wrote:
...
f has no input; it's output is the function g, whose environment is the 
evaluation environment of f.  g is never used, but it is returned as the 
value of f.  Thus we have the loop:

g refers to the environment.
the environment contains g.

Even though the result of f() was never saved, two things (the 
environment and g) got created and each would have non-zero reference 
count.

In a more complicated situation you might want to save the result of the 
function and then modify it.  But because of the loop above, you would 
always think there's another reference to the object, so every in-place 
modification would require a copy first.

Duncan Murdoch
#
On Fri, 28 Sep 2007, Luke Tierney wrote:

            
[...]
A likely explanation is lazy-loading.  Almost all the package code is 
stored externally until used: 2.6.0 is better at not bringing in unused 
code.  E.g. (2.6.0, 64-bit system)
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 141320  7.6     350000 18.7   350000 18.7
Vcells 130043  1.0     786432  6.0   561893  4.3
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 424383 22.7     531268 28.4   437511 23.4
Vcells 228005  1.8     786432  6.0   700955  5.4

'if I remove all data objects by rm()' presumably means clearing the 
user workspace: there are lots of other environments containing objects 
('data' or otherwise), many of which are needed to run R.

Otherwise the footer to every R-help message applies ....
I am not persuaded that the difference between NAMED=1/2 makes much 
difference in general use of R, and I recall Ross saying that he no longer 
believed that this was a worthwhile optimization.  It's not just 
'user-defined' replacement functions, but also all the system-defined 
closures (including all methods for the generic replacement functions 
which are primitive) that are unable to benefit from it.
#
Duncan Murdoch wrote:
Thanks Duncan. It was way past my bedtime when I wrote that...

I had actually missed the point about the return value,  but the point 
remains even if you let f return something other than g: You get a 
situation where the two objects both have a refcount of 1, so by 
standard refcounting semantics neither can be removed even though 
neither object is reachable.

Put differently, standard refcounting assumes that references between 
objects of the language form a directed acyclic graph, but when 
environments are involved, there can be cycles in R-like languages.

    -p

  
    
#
On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:
[...]
If a user-defined function evaluates its body in its parent environment
using the suggestion of Peter Dalgaard eval.parent(substitute( .... )),
then NAMED attribute is not increased and the function may do in place
modifications.
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:

        
On Fri, Sep 28, 2007 at 09:46:39AM -0400, Duncan Murdoch wrote:
Thank you very much for the example and explanation. I would
not guess, something like this is possible, but now I see that
it may, in fact, be quite common. For example
  something <- function()
  {
      a <- 1:5
      b <- 6:10
      c <- c("a","a","b","b","b")
      mf <- model.frame(c ~ a + b)
      mf
  }
  mf1 <- something()
  e1 <- attr(attr(mf1,"terms"),".Environment")
  mf2 <- eval(expression(mf),envir=e1)
  e2 <- attr(attr(mf2,"terms"),".Environment")
  print(identical(e1,e2)) # TRUE
seems to be a similar situation. Here, the references go in the
sequence mf1 -> e1 -> mf2 -> e1. I think that already mf2 is
the same as mf1, but I do not know how to demonstrate this.
However, both mf1 and mf2 refer to the same environment, so
e1 -> mf2 -> e1 is a cycle for sure.
On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:

        
On Fri, Sep 28, 2007 at 04:36:40PM +0100, Prof Brian Ripley wrote:
[...]
[...]
I am thinking about the following situation. The user creates a large
matrix A and then performs a sequence of operations on it. Some of
the operations scan the matrix in a read-only manner (calculating e.g.
some summaries), some operations are top level commands, which modify the
matrix itself. I do not argue that such a sequence of operations should
be done in place by default. However, I think that R should provide
tools, which allow to do this in place, if the user does some extra
work. If the matrix is really large, then in place operations are not
only more space efficient, but also more time efficient.

Using the information from the current thread, there are two
possible approaches to reach this.

1. The initial matrix should not be generated by "matrix" function
   due to the observation by Henrik Bengtsson (this is the issue
   with dimnames). The matrix may be initiated using e.g.
     .Internal(matrix(data, nrow, ncol, byrow))

   The matrix should not be scanned using an R function, which evaluates
   its body in its own enviroment. This includes functions nrow, ncol,
   colSums, rowSums and probaly more. The matrix may be scanned by
   functions, which use eval.parent(substitute( .... )) and avoid giving
   the matrix a new name. The user may prepare versions of nrow, ncol,
   colSums, rowSums, etc. with this property.

2. If NAMED attribute of A may be decreased from 2 to 1 during an operation
   similar to garbage collection (if A is not in a reference cycle), then the
   above approach may be combined also with operations, which work themselves
   in place and read only, but increase NAMED(A) as a side effect. In this
   case, the user should explicitly invoke the "NAMED reduction" after such
   operations. If the user has only a small number of large objects, then
   gc() is faster then duplication of some of the large things. So, I expect
   that the "NAMED reduction" could be also more time efficient than some
   of the unwanted duplications.

During the previous discussion, the exact counting of references was
sometimes mentioned. So, I want to explicitly state that I do not
think, it is a good idea. In my opinion, it is definitely not reasonable
now. I am very satisfied with the stability of R sessions and this
would be in danger during the transition to full counting. Moreover,
I can imagine (I am not an expert on this) that the efficiency and
simplicity benefit of the guaranteed approximate counting outweigh the
disadvantages (a bit more duplications than necessary) in a typical R
session.

However, there are situations, where the cost of duplication is
too high and the user knows about it in advance. In such situations,
having more tools for explicit control of duplication could help.
The tools may be, for example, some function, which allows a simple
query to the NAMED status of a given object on R level and
modifying some of the built-in functions to be more careful with
NAMED attribute. A possible strengthening of gc() would, of course, be
very useful here. I think about an explicit use of it, not about
the automatical runs. So, for safety reasons, the "NAMED reduction"
could be done by a different function, not the default gc() itself.

Petr Savicky.