I have a C function, which performs a transformation
of a large integer matrix. The matrix may be of size 1.6GB,
so I can have only one copy in RAM and have to modify it
in place. This is possible using .Call and works fine. For
debugging, I need two copies of a smaller matrix and modify only
one of them. This may also be done, for example, by
A <- some integer matrix
B <- A + as.integer(0)
.Call("transform", A)
Then, B is still the original and A is the transformed one.
Up to now, I do not have any real problem with this, but there
are things, which could help me. Namely the following ones:
1. Is there a general way to duplicate an R object on the
level of R language? I would use it instead of
B <- A + as.integer(0)
Looking for the keywords "duplicate" and "copy" in help pages
and R-exts did not find what I need.
2. Is there a way how a C function can verify that its
argument does not share data with other objects?
I included a test of NAMED(A) in my C function, but I do
not know, whether it may be used for the purpose which I need.
I did not find a way how to generate an R object with
NAMED equal 0. Is this possible on the level of R language?
If yes, I will require that the argument passed to my
function is generated in this way and the function
will refuse to modify it if it is not so.
The documentation suggests to call "duplicate", if the
data are shared, but I cannot afford this due to memory.
So, I can only stop if NAMED is not 0.
I appreciate, if anybody could give me advice on the above things.
Petr Savicky.
modifying large R objects in place
14 messages · Byron Ellis, Petr Savicky, Brian Ripley +3 more
For the most part, doing anything to an R object result in it's duplication. You generally have to do a lot of work to NOT copy an R object.
On 9/26/07, Petr Savicky <savicky at cs.cas.cz> wrote:
I have a C function, which performs a transformation
of a large integer matrix. The matrix may be of size 1.6GB,
so I can have only one copy in RAM and have to modify it
in place. This is possible using .Call and works fine. For
debugging, I need two copies of a smaller matrix and modify only
one of them. This may also be done, for example, by
A <- some integer matrix
B <- A + as.integer(0)
.Call("transform", A)
Then, B is still the original and A is the transformed one.
Up to now, I do not have any real problem with this, but there
are things, which could help me. Namely the following ones:
1. Is there a general way to duplicate an R object on the
level of R language? I would use it instead of
B <- A + as.integer(0)
Looking for the keywords "duplicate" and "copy" in help pages
and R-exts did not find what I need.
2. Is there a way how a C function can verify that its
argument does not share data with other objects?
I included a test of NAMED(A) in my C function, but I do
not know, whether it may be used for the purpose which I need.
I did not find a way how to generate an R object with
NAMED equal 0. Is this possible on the level of R language?
If yes, I will require that the argument passed to my
function is generated in this way and the function
will refuse to modify it if it is not so.
The documentation suggests to call "duplicate", if the
data are shared, but I cannot afford this due to memory.
So, I can only stop if NAMED is not 0.
I appreciate, if anybody could give me advice on the above things.
Petr Savicky.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Byron Ellis (byron.ellis at gmail.com) "Oook" -- The Librarian
On Wed, Sep 26, 2007 at 10:52:28AM -0700, Byron Ellis wrote:
For the most part, doing anything to an R object result in it's duplication. You generally have to do a lot of work to NOT copy an R object.
Thank you for your response. Unfortunately, you are right. For example,
the allocated memory determined by top command on Linux may change during
a session as follows:
a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
a[1,1] <- 0 # 3.0g
gc() # 1.5g
In the current applicatin, I modify the matrix only using my own C code
and only read it on R level. So, the above is not a big problem for me
(at least not now).
However, there is a related thing, which could be a bug. The following
code determines the value of NAMED field in SEXP header of an object:
SEXP getnamed(SEXP a)
{
SEXP out;
PROTECT(out = allocVector(INTSXP, 1));
INTEGER(out)[0] = NAMED(a);
UNPROTECT(1);
return(out);
}
Now, consider the following session
u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
.Call("getnamed",u) # 1 (OK)
length(u)
.Call("getnamed",u) # 1 (OK)
dim(u)
.Call("getnamed",u) # 1 (OK)
nrow(u)
.Call("getnamed",u) # 2 (why?)
u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
.Call("getnamed",u) # 1 (OK)
ncol(u)
.Call("getnamed",u) # 2 (so, ncol does the same)
Is this a bug?
Petr Savicky.
In my previous email, I sent the example:
a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m a[1,1] <- 0 # 3.0g gc() # 1.5g
This is misleading. The correct version is a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m a[1,1] <- as.integer(0) # 1.5g gc() # 774m So, the object duplicates, but nothing more. The main part of my previous email (question concerning a possible bug in the behavior of nrow(a) and ncol(a)) remains open. Petr Savicky.
Petr Savicky wrote:
On Wed, Sep 26, 2007 at 10:52:28AM -0700, Byron Ellis wrote:
For the most part, doing anything to an R object result in it's
duplication. You generally have to do a lot of work to NOT copy an R
object.
Thank you for your response. Unfortunately, you are right. For example,
the allocated memory determined by top command on Linux may change during
a session as follows:
a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
a[1,1] <- 0 # 3.0g
gc() # 1.5g
In the current applicatin, I modify the matrix only using my own C code
and only read it on R level. So, the above is not a big problem for me
(at least not now).
However, there is a related thing, which could be a bug. The following
code determines the value of NAMED field in SEXP header of an object:
SEXP getnamed(SEXP a)
{
SEXP out;
PROTECT(out = allocVector(INTSXP, 1));
INTEGER(out)[0] = NAMED(a);
UNPROTECT(1);
return(out);
}
Now, consider the following session
u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
.Call("getnamed",u) # 1 (OK)
length(u)
.Call("getnamed",u) # 1 (OK)
dim(u)
.Call("getnamed",u) # 1 (OK)
nrow(u)
.Call("getnamed",u) # 2 (why?)
u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
.Call("getnamed",u) # 1 (OK)
ncol(u)
.Call("getnamed",u) # 2 (so, ncol does the same)
Is this a bug?
No. It is an infelicity. The issues are that 1. length() and dim() call .Primitive directly, whereas nrow() and ncol() are "real" R functions 2. NAMED records whether an object has _ever_ had 0, 1, or 2+ names During the evaluation of ncol(u). the argument x is evaluated, and at that point the object "u" is also named "x" in the evaluation frame of ncol(). A full(er) reference counting system might drop NAMED back to 1 when exiting ncol(), but currently, R can only count up (and trying to find the conditions under which it is safe to reduce NAMED will make your head spin, believe me! )
Petr Savicky.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
1) You implicitly coerced 'a' to be numeric and thereby (almost) doubled its size: did you intend to? Does that explain your confusion? 2) I expected NAMED on 'a' to be incremented by nrow(a): here is my understanding. When you called nrow(a) you created another reference to 'a' in the evaluation frame of nrow. (At a finer level you first created a promise to 'a' and then dim(x) evaluated that promise, which did SET_NAMED(<SEXP>) = 2.) So NAMED(a) was correctly bumped to 2, and it is never reduced. More generally, any argument to a closure that actually gets used will get NAMED set to 2. Having too high a value of NAMED could never be a 'bug'. See the explanation in the R Internals manual: When an object is about to be altered, the named field is consulted. A value of 2 means that the object must be duplicated before being changed. (Note that this does not say that it is necessary to duplicate, only that it should be duplicated whether necessary or not.) 3) Memory profiling can be helpful in telling you exactly what copies get made.
On Thu, 27 Sep 2007, Petr Savicky wrote:
On Wed, Sep 26, 2007 at 10:52:28AM -0700, Byron Ellis wrote:
For the most part, doing anything to an R object result in it's duplication. You generally have to do a lot of work to NOT copy an R object.
Thank you for your response. Unfortunately, you are right. For example,
the allocated memory determined by top command on Linux may change during
a session as follows:
a <- matrix(as.integer(1),nrow=14100,ncol=14100) # 774m
a[1,1] <- 0 # 3.0g
gc() # 1.5g
In the current applicatin, I modify the matrix only using my own C code
and only read it on R level. So, the above is not a big problem for me
(at least not now).
However, there is a related thing, which could be a bug. The following
code determines the value of NAMED field in SEXP header of an object:
SEXP getnamed(SEXP a)
{
SEXP out;
PROTECT(out = allocVector(INTSXP, 1));
INTEGER(out)[0] = NAMED(a);
UNPROTECT(1);
return(out);
}
Now, consider the following session
u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
.Call("getnamed",u) # 1 (OK)
length(u)
.Call("getnamed",u) # 1 (OK)
dim(u)
.Call("getnamed",u) # 1 (OK)
nrow(u)
.Call("getnamed",u) # 2 (why?)
u <- matrix(as.integer(1),nrow=5,ncol=3) + as.integer(0)
.Call("getnamed",u) # 1 (OK)
ncol(u)
.Call("getnamed",u) # 2 (so, ncol does the same)
Is this a bug?
Petr Savicky.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Thank you very much for all the explanations. In particular for pointing out that nrow is not a .Primitive unlike dim, which is the reason for the difference in their behavior. (I rised the question of possible bug due to this difference, not just being unsatisfied with nrow). Also, thanks for:
On Thu, Sep 27, 2007 at 05:59:05PM +0100, Prof Brian Ripley wrote:
[...]
2) I expected NAMED on 'a' to be incremented by nrow(a): here is my understanding. When you called nrow(a) you created another reference to 'a' in the evaluation frame of nrow. (At a finer level you first created a promise to 'a' and then dim(x) evaluated that promise, which did SET_NAMED(<SEXP>) = 2.) So NAMED(a) was correctly bumped to 2, and it is never reduced. More generally, any argument to a closure that actually gets used will get NAMED set to 2.
[...] This explains a lot. I appreciate also the patch to matrix by Henrik Bengtsson, which saved me time formulating a further question just about this. I do not know, whether there is a reason to keep nrow, ncol not .Primitive, but if there is such, the problem may be solved by rewriting them as follows: nrow <- function(...) dim(...)[1] ncol <- function(...) dim(...)[2] At least in my environment, the new versions preserved NAMED == 1. It has a side effect that this unifies the error messages generated by too many arguments to nrow(x) and dim(x). Currently a <- matrix(1:6,nrow=2) nrow(a,a) # Error in nrow(a, a) : unused argument(s) (1:6) dim(a,a) # Error: 2 arguments passed to 'dim' which requires 1 May be, also other solutions exist. Petr Savicky.
Petr Savicky wrote:
Thank you very much for all the explanations. In particular for pointing out that nrow is not a .Primitive unlike dim, which is the reason for the difference in their behavior. (I rised the question of possible bug due to this difference, not just being unsatisfied with nrow). Also, thanks for: On Thu, Sep 27, 2007 at 05:59:05PM +0100, Prof Brian Ripley wrote: [...]
2) I expected NAMED on 'a' to be incremented by nrow(a): here is my
understanding.
When you called nrow(a) you created another reference to 'a' in the
evaluation frame of nrow. (At a finer level you first created a promise
to 'a' and then dim(x) evaluated that promise, which did SET_NAMED(<SEXP>)
= 2.) So NAMED(a) was correctly bumped to 2, and it is never reduced.
More generally, any argument to a closure that actually gets used will
get NAMED set to 2.
[...] This explains a lot. I appreciate also the patch to matrix by Henrik Bengtsson, which saved me time formulating a further question just about this. I do not know, whether there is a reason to keep nrow, ncol not .Primitive, but if there is such, the problem may be solved by rewriting them as follows: nrow <- function(...) dim(...)[1] ncol <- function(...) dim(...)[2] At least in my environment, the new versions preserved NAMED == 1.
Yes, but changing the formal arguments is a bit messy, is it not?
Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works
too, but if the gain is important enough to warrant that sort of
programming, you might as well make nrow a .Primitive.
Longer-term, I still have some hope for better reference counting, but
the semantics of environments make it really ugly -- an environment can
contain an object that contains the environment, a simple example being
f <- function()
g <- function() 0
f()
At the end of f(), we should decide whether to destroy f's evaluation
environment. In the present example, what we need to be able to see is
that this would remove all refences to g and that the reference from g
to f can therefore be ignored. Complete logic for sorting this out is
basically equivalent to a new garbage collector, and one can suspect
that applying the logic upon every function return is going to be
terribly inefficient. However, partial heuristics might apply.
It has a side effect that this unifies the error messages generated by too many arguments to nrow(x) and dim(x). Currently a <- matrix(1:6,nrow=2) nrow(a,a) # Error in nrow(a, a) : unused argument(s) (1:6) dim(a,a) # Error: 2 arguments passed to 'dim' which requires 1 May be, also other solutions exist. Petr Savicky.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
[...]
nrow <- function(...) dim(...)[1] ncol <- function(...) dim(...)[2] At least in my environment, the new versions preserved NAMED == 1.
Yes, but changing the formal arguments is a bit messy, is it not?
Specifically for nrow, ncol, I think not much, since almost nobody needs to know (or even knows) that the name of the formal argument is "x". However, there is another argument against the ... solution: it solves the problem only in the simplest cases like nrow, ncol, but is not usable in other, like colSums, rowSums. These functions also increase NAMED of its argument, although their output does not contain any reference to the original content of their arguments. I think that a systematic solution of this problem may be helpful. However, making these functions Internal or Primitive would not be good in my opinion. It is advantageous that these functions contain an R level part, which makes the basic decisions before a call to .Internal. If nothing else, this serves as a sort of documentation. For my purposes, I replaced calls to "colSums" and "matrix" by the corresponding calls to .Internal in my script. The result is that now I can complete several runs of my calculation in a cycle instead of restarting R after each of the runs. This leads me to a question. Some of the tests, which I did, suggest that gc() may not free all the memory, even if I remove all data objects by rm() before calling gc(). Is this possible or I must have missed something? A possible solution to the unwanted increase of NAMED due to temporary calculations could be to give the user the possibility to store NAMED attribute of an object before a call to a function and restore it after the call. To use this, the user should be confident that no new reference to the object persists after the function is completed.
Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works too, but if the gain is important enough to warrant that sort of programming, you might as well make nrow a .Primitive.
You are right. This indeed works.
Longer-term, I still have some hope for better reference counting, but the semantics of environments make it really ugly -- an environment can contain an object that contains the environment, a simple example being f <- function() g <- function() 0 f() At the end of f(), we should decide whether to destroy f's evaluation environment. In the present example, what we need to be able to see is that this would remove all refences to g and that the reference from g to f can therefore be ignored. Complete logic for sorting this out is basically equivalent to a new garbage collector, and one can suspect that applying the logic upon every function return is going to be terribly inefficient. However, partial heuristics might apply.
I have to say that I do not understand the example very much. What is the input and output of f? Is g inside only defined or also used? Let me ask the following question. I assume that gc() scans the whole memory and determines for each part of data, whether a reference to it still exists or not. In my understanding, this is equivalent to determine, whether NAMED of it may be dropped to zero or not. Structures, for which this succeeds are then removed. Am I right? If yes, is it possible during gc() to determine also cases, when NAMED may be dropped from 2 to 1? How much would this increase the complexity of gc()? Thank you in advance for your kind reply. Petr Savicky.
On Fri, 28 Sep 2007, Petr Savicky wrote:
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote: [...]
nrow <- function(...) dim(...)[1] ncol <- function(...) dim(...)[2] At least in my environment, the new versions preserved NAMED == 1.
I believe this is a bug in the evaluation of ... arguments. THe intent in the code is I believe that all promise evaluations result in NAMED==2 for safety. That may be overly conservative but I would not want to change it without some very careful thought -- I prefer to wait a little longer for the right answer than to get a wrong one quickly.
Yes, but changing the formal arguments is a bit messy, is it not?
Specifically for nrow, ncol, I think not much, since almost nobody needs to know (or even knows) that the name of the formal argument is "x". However, there is another argument against the ... solution: it solves the problem only in the simplest cases like nrow, ncol, but is not usable in other, like colSums, rowSums. These functions also increase NAMED of its argument, although their output does not contain any reference to the original content of their arguments. I think that a systematic solution of this problem may be helpful. However, making these functions Internal or Primitive would not be good in my opinion. It is advantageous that these functions contain an R level part, which makes the basic decisions before a call to .Internal. If nothing else, this serves as a sort of documentation. For my purposes, I replaced calls to "colSums" and "matrix" by the corresponding calls to .Internal in my script. The result is that now I can complete several runs of my calculation in a cycle instead of restarting R after each of the runs. This leads me to a question. Some of the tests, which I did, suggest that gc() may not free all the memory, even if I remove all data objects by rm() before calling gc(). Is this possible or I must have missed something?
Not impossible but very unlikely givent he use gc gets. There are a few internal tables that are grown but not shrunk at the moment but that should not usually cause much total growth. If you are ooking at system memopry use then that is a malloc issue -- there was a thread about this a month or so ago.
A possible solution to the unwanted increase of NAMED due to temporary calculations could be to give the user the possibility to store NAMED attribute of an object before a call to a function and restore it after the call. To use this, the user should be confident that no new reference to the object persists after the function is completed.
This would be too dangerous for general use. Some more structured approach may be possible. A related issue is that user-defined assignment functions always see a NAMED of 2 and hence cannot modify in place. We've been trying to come up with a reasonable solution to this, so far without success but I'm moderately hopeful.
Presumably, nrow <- function(x) eval.parent(substitute(dim(x)[1])) works too, but if the gain is important enough to warrant that sort of programming, you might as well make nrow a .Primitive.
You are right. This indeed works.
Longer-term, I still have some hope for better reference counting, but the semantics of environments make it really ugly -- an environment can contain an object that contains the environment, a simple example being f <- function() g <- function() 0 f() At the end of f(), we should decide whether to destroy f's evaluation environment. In the present example, what we need to be able to see is that this would remove all refences to g and that the reference from g to f can therefore be ignored. Complete logic for sorting this out is basically equivalent to a new garbage collector, and one can suspect that applying the logic upon every function return is going to be terribly inefficient. However, partial heuristics might apply.
I have to say that I do not understand the example very much. What is the input and output of f? Is g inside only defined or also used? Let me ask the following question. I assume that gc() scans the whole memory and determines for each part of data, whether a reference to it still exists or not. In my understanding, this is equivalent to determine, whether NAMED of it may be dropped to zero or not. Structures, for which this succeeds are then removed. Am I right? If yes, is it possible during gc() to determine also cases, when NAMED may be dropped from 2 to 1? How much would this increase the complexity of gc()?
Probably not impossible but would be a fair bit of work with probably not much gain as the NAMED values would still be high until the next gc of the appropriate level, which will probably be a fair time as an object being modified is likely to be older, but the interval in which there would be a benefit is short. The basic functional model that underlies having the illuison of non-modifyable vector data does not fit all that well with an imperative style of modifying things in loops. It might be useful to bring in some constructs from functional programming that are designed to allow in-place modification to coexist with functional semantics. Probably a longer term issue. For now there are limits to what we can reasonable, and maintainably, do in an interpreted R. Having full reference counts might help but might not because of other costs involved (significant increases in cache misses in particular) but in any case it would probably be easier to rewrite R from scratch than to retro-fit full reference cunting to what we have so I an't see it happening real soon. Also it doesn't help with many things, like user-level assignment: there really are two references at the key point in that case. With compilation it may be possible to do some memory use analysis and work out when it is safe to do destructive modification, but that is a fair way off as well. Best, luke
Thank you in advance for your kind reply. Petr Savicky.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Luke Tierney
Chair, Statistics and Actuarial Science
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa Phone: 319-335-3386
Department of Statistics and Fax: 319-335-3017
Actuarial Science
241 Schaeffer Hall email: luke at stat.uiowa.edu
Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu
On 9/28/2007 7:45 AM, Petr Savicky wrote:
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
...
Longer-term, I still have some hope for better reference counting, but the semantics of environments make it really ugly -- an environment can contain an object that contains the environment, a simple example being f <- function() g <- function() 0 f() At the end of f(), we should decide whether to destroy f's evaluation environment. In the present example, what we need to be able to see is that this would remove all refences to g and that the reference from g to f can therefore be ignored. Complete logic for sorting this out is basically equivalent to a new garbage collector, and one can suspect that applying the logic upon every function return is going to be terribly inefficient. However, partial heuristics might apply.
I have to say that I do not understand the example very much. What is the input and output of f? Is g inside only defined or also used?
f has no input; it's output is the function g, whose environment is the evaluation environment of f. g is never used, but it is returned as the value of f. Thus we have the loop: g refers to the environment. the environment contains g. Even though the result of f() was never saved, two things (the environment and g) got created and each would have non-zero reference count. In a more complicated situation you might want to save the result of the function and then modify it. But because of the loop above, you would always think there's another reference to the object, so every in-place modification would require a copy first. Duncan Murdoch
On Fri, 28 Sep 2007, Luke Tierney wrote:
On Fri, 28 Sep 2007, Petr Savicky wrote:
[...]
This leads me to a question. Some of the tests, which I did, suggest that gc() may not free all the memory, even if I remove all data objects by rm() before calling gc(). Is this possible or I must have missed something?
Not impossible but very unlikely givent he use gc gets. There are a few internal tables that are grown but not shrunk at the moment but that should not usually cause much total growth. If you are ooking at system memopry use then that is a malloc issue -- there was a thread about this a month or so ago.
A likely explanation is lazy-loading. Almost all the package code is stored externally until used: 2.6.0 is better at not bringing in unused code. E.g. (2.6.0, 64-bit system)
gc()
used (Mb) gc trigger (Mb) max used (Mb) Ncells 141320 7.6 350000 18.7 350000 18.7 Vcells 130043 1.0 786432 6.0 561893 4.3
for(s in search()) lsf.str(s) gc()
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 424383 22.7 531268 28.4 437511 23.4
Vcells 228005 1.8 786432 6.0 700955 5.4
'if I remove all data objects by rm()' presumably means clearing the
user workspace: there are lots of other environments containing objects
('data' or otherwise), many of which are needed to run R.
Otherwise the footer to every R-help message applies ....
A possible solution to the unwanted increase of NAMED due to temporary calculations could be to give the user the possibility to store NAMED attribute of an object before a call to a function and restore it after the call. To use this, the user should be confident that no new reference to the object persists after the function is completed.
This would be too dangerous for general use. Some more structured approach may be possible. A related issue is that user-defined assignment functions always see a NAMED of 2 and hence cannot modify in place. We've been trying to come up with a reasonable solution to this, so far without success but I'm moderately hopeful.
I am not persuaded that the difference between NAMED=1/2 makes much difference in general use of R, and I recall Ross saying that he no longer believed that this was a worthwhile optimization. It's not just 'user-defined' replacement functions, but also all the system-defined closures (including all methods for the generic replacement functions which are primitive) that are unable to benefit from it.
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UK Fax: +44 1865 272595
Duncan Murdoch wrote:
On 9/28/2007 7:45 AM, Petr Savicky wrote:
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
...
Longer-term, I still have some hope for better reference counting, but
the semantics of environments make it really ugly -- an environment can
contain an object that contains the environment, a simple example being
f <- function()
g <- function() 0
f()
At the end of f(), we should decide whether to destroy f's evaluation
environment. In the present example, what we need to be able to see is
that this would remove all refences to g and that the reference from g
to f can therefore be ignored. Complete logic for sorting this out is
basically equivalent to a new garbage collector, and one can suspect
that applying the logic upon every function return is going to be
terribly inefficient. However, partial heuristics might apply.
I have to say that I do not understand the example very much.
What is the input and output of f? Is g inside only defined or
also used?
f has no input; it's output is the function g, whose environment is the evaluation environment of f. g is never used, but it is returned as the value of f. Thus we have the loop: g refers to the environment. the environment contains g. Even though the result of f() was never saved, two things (the environment and g) got created and each would have non-zero reference count. In a more complicated situation you might want to save the result of the function and then modify it. But because of the loop above, you would always think there's another reference to the object, so every in-place modification would require a copy first.
Thanks Duncan. It was way past my bedtime when I wrote that...
I had actually missed the point about the return value, but the point
remains even if you let f return something other than g: You get a
situation where the two objects both have a refcount of 1, so by
standard refcounting semantics neither can be removed even though
neither object is reachable.
Put differently, standard refcounting assumes that references between
objects of the language form a directed acyclic graph, but when
environments are involved, there can be cycles in R-like languages.
-p
Duncan Murdoch
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:
[...]
[...] A related issue is that user-defined assignment functions always see a NAMED of 2 and hence cannot modify in place. We've been trying to come up with a reasonable solution to this, so far without success but I'm moderately hopeful.
If a user-defined function evaluates its body in its parent environment using the suggestion of Peter Dalgaard eval.parent(substitute( .... )), then NAMED attribute is not increased and the function may do in place modifications.
On Fri, Sep 28, 2007 at 12:39:30AM +0200, Peter Dalgaard wrote:
Longer-term, I still have some hope for better reference counting, but the semantics of environments make it really ugly -- an environment can contain an object that contains the environment, a simple example being f <- function() g <- function() 0 f()
On Fri, Sep 28, 2007 at 09:46:39AM -0400, Duncan Murdoch wrote:
f has no input; it's output is the function g, whose environment is the evaluation environment of f. g is never used, but it is returned as the value of f. Thus we have the loop: g refers to the environment. the environment contains g. Even though the result of f() was never saved, two things (the environment and g) got created and each would have non-zero reference count.
Thank you very much for the example and explanation. I would
not guess, something like this is possible, but now I see that
it may, in fact, be quite common. For example
something <- function()
{
a <- 1:5
b <- 6:10
c <- c("a","a","b","b","b")
mf <- model.frame(c ~ a + b)
mf
}
mf1 <- something()
e1 <- attr(attr(mf1,"terms"),".Environment")
mf2 <- eval(expression(mf),envir=e1)
e2 <- attr(attr(mf2,"terms"),".Environment")
print(identical(e1,e2)) # TRUE
seems to be a similar situation. Here, the references go in the
sequence mf1 -> e1 -> mf2 -> e1. I think that already mf2 is
the same as mf1, but I do not know how to demonstrate this.
However, both mf1 and mf2 refer to the same environment, so
e1 -> mf2 -> e1 is a cycle for sure.
On Fri, Sep 28, 2007 at 08:14:45AM -0500, Luke Tierney wrote:
If yes, is it possible during gc() to determine also cases, when NAMED may be dropped from 2 to 1? How much would this increase the complexity of gc()?
Probably not impossible but would be a fair bit of work with probably not much gain as the NAMED values would still be high until the next gc of the appropriate level, which will probably be a fair time as an object being modified is likely to be older, but the interval in which there would be a benefit is short.
On Fri, Sep 28, 2007 at 04:36:40PM +0100, Prof Brian Ripley wrote:
[...]
On Fri, 28 Sep 2007, Luke Tierney wrote:
[...]
approach may be possible. A related issue is that user-defined assignment functions always see a NAMED of 2 and hence cannot modify in place. We've been trying to come up with a reasonable solution to this, so far without success but I'm moderately hopeful.
I am not persuaded that the difference between NAMED=1/2 makes much difference in general use of R, and I recall Ross saying that he no longer believed that this was a worthwhile optimization. It's not just 'user-defined' replacement functions, but also all the system-defined closures (including all methods for the generic replacement functions which are primitive) that are unable to benefit from it.
I am thinking about the following situation. The user creates a large
matrix A and then performs a sequence of operations on it. Some of
the operations scan the matrix in a read-only manner (calculating e.g.
some summaries), some operations are top level commands, which modify the
matrix itself. I do not argue that such a sequence of operations should
be done in place by default. However, I think that R should provide
tools, which allow to do this in place, if the user does some extra
work. If the matrix is really large, then in place operations are not
only more space efficient, but also more time efficient.
Using the information from the current thread, there are two
possible approaches to reach this.
1. The initial matrix should not be generated by "matrix" function
due to the observation by Henrik Bengtsson (this is the issue
with dimnames). The matrix may be initiated using e.g.
.Internal(matrix(data, nrow, ncol, byrow))
The matrix should not be scanned using an R function, which evaluates
its body in its own enviroment. This includes functions nrow, ncol,
colSums, rowSums and probaly more. The matrix may be scanned by
functions, which use eval.parent(substitute( .... )) and avoid giving
the matrix a new name. The user may prepare versions of nrow, ncol,
colSums, rowSums, etc. with this property.
2. If NAMED attribute of A may be decreased from 2 to 1 during an operation
similar to garbage collection (if A is not in a reference cycle), then the
above approach may be combined also with operations, which work themselves
in place and read only, but increase NAMED(A) as a side effect. In this
case, the user should explicitly invoke the "NAMED reduction" after such
operations. If the user has only a small number of large objects, then
gc() is faster then duplication of some of the large things. So, I expect
that the "NAMED reduction" could be also more time efficient than some
of the unwanted duplications.
During the previous discussion, the exact counting of references was
sometimes mentioned. So, I want to explicitly state that I do not
think, it is a good idea. In my opinion, it is definitely not reasonable
now. I am very satisfied with the stability of R sessions and this
would be in danger during the transition to full counting. Moreover,
I can imagine (I am not an expert on this) that the efficiency and
simplicity benefit of the guaranteed approximate counting outweigh the
disadvantages (a bit more duplications than necessary) in a typical R
session.
However, there are situations, where the cost of duplication is
too high and the user knows about it in advance. In such situations,
having more tools for explicit control of duplication could help.
The tools may be, for example, some function, which allows a simple
query to the NAMED status of a given object on R level and
modifying some of the built-in functions to be more careful with
NAMED attribute. A possible strengthening of gc() would, of course, be
very useful here. I think about an explicit use of it, not about
the automatical runs. So, for safety reasons, the "NAMED reduction"
could be done by a different function, not the default gc() itself.
Petr Savicky.