Martin Morgan commented in email to me that a change to any slot of an object that has other, large slot(s) does substantial computation, presumably from copying the whole object. Is there anything to be done? There are in fact two possible changes, one automatic but only partial, the other requiring some action on the programmer's part. Herewith the first; I'll discuss the second in a later email. Some context: The notion is that our object has some big data and some additional smaller things. We need to change the small things but would rather not copy the big things all the time. (With long vectors, this becomes even more relevant.) There are three likely scenarios: slots, attributes and named list components. Suppose our object has "little" and "BIG" encoded in one of these. The three relevant computations are: x at little <- other attr(x, "little") <- other x$little <- other It turns out that these are all similar in behavior with one important exception--fixing that is the automatic change. I need to review what R does here. All these are replacement functions, `@<-`, `attr<-`, `$<-`. The evaluator checks before calling any replacement whether the object needs to be duplicated (in a routine EnsureLocal()). It does that by examining a special field that holds the reference status of the object. Some languages, such as Python (and S) keep reference counts for each object, de-allocating the object when the reference count drops back to zero. R uses a different strategy. Its NAMED() field is 0, 1 or 2 according to whether the object has been assigned never, once or more than once. The field is not a reference count and is not decremented--relevant for this issue. Objects are de-allocated only when garbage collection occurs and the object does not appear in any current frame or other context. (I did not write any of this code, so apologies if I'm misrepresenting it.) When any of these replacement operations first occurs for a particular object in a particular function call, it's very likely that the reference status will be 2 and EnsureLocal will duplicate it--all of it. Regardless of which of the three forms is used. Here the non-level-playing-field aspect comes in. `@<-` is a normal R function (a "closure") but the other two are primitives in the main code for R. Primitives have no frame in which arguments are stored. As a result the new version of x is normally stored with status 1. If one does a second replacement in the same call (in a loop, e.g.) that should not normally copy again. But the result of `@<-` will be an object from its frame and will have status 2 when saved, forcing a copy each time. So the change, naturally, is that R 3.0.0 will have a primitive implementation of `@<`. This has been implemented in r-devel (rev. 61544). Please try it out _before_ we issue that version, especially if you own a package that does things related to this question. John PS: Some may have noticed that I didn't mention a fourth approach: fields in a reference class object. The assumption was that we wanted classical, functional behavior here. Reference classes don't have the copy problem but don't behave functionally either. But that is in fact the direction for the other approach. I'll discuss that later, when the corresponding code is available.
Small changes to big objects (1)
10 messages · John Chambers, Douglas Bates, Simon Urbanek +3 more
One point that came up in the CRAN checks, that I should have made explicit: The new version of "@<-" has to move from the "methods" package to "base". Therefore you should not (and can not) explicitly import it from "methods"--that will fail in the import phase of installation. John
On 1/3/13 11:08 AM, John Chambers wrote:
Martin Morgan commented in email to me that a change to any slot of an object that has other, large slot(s) does substantial computation, presumably from copying the whole object. Is there anything to be done? There are in fact two possible changes, one automatic but only partial, the other requiring some action on the programmer's part. Herewith the first; I'll discuss the second in a later email. Some context: The notion is that our object has some big data and some additional smaller things. We need to change the small things but would rather not copy the big things all the time. (With long vectors, this becomes even more relevant.) There are three likely scenarios: slots, attributes and named list components. Suppose our object has "little" and "BIG" encoded in one of these. The three relevant computations are: x at little <- other attr(x, "little") <- other x$little <- other It turns out that these are all similar in behavior with one important exception--fixing that is the automatic change. I need to review what R does here. All these are replacement functions, `@<-`, `attr<-`, `$<-`. The evaluator checks before calling any replacement whether the object needs to be duplicated (in a routine EnsureLocal()). It does that by examining a special field that holds the reference status of the object. Some languages, such as Python (and S) keep reference counts for each object, de-allocating the object when the reference count drops back to zero. R uses a different strategy. Its NAMED() field is 0, 1 or 2 according to whether the object has been assigned never, once or more than once. The field is not a reference count and is not decremented--relevant for this issue. Objects are de-allocated only when garbage collection occurs and the object does not appear in any current frame or other context. (I did not write any of this code, so apologies if I'm misrepresenting it.) When any of these replacement operations first occurs for a particular object in a particular function call, it's very likely that the reference status will be 2 and EnsureLocal will duplicate it--all of it. Regardless of which of the three forms is used. Here the non-level-playing-field aspect comes in. `@<-` is a normal R function (a "closure") but the other two are primitives in the main code for R. Primitives have no frame in which arguments are stored. As a result the new version of x is normally stored with status 1. If one does a second replacement in the same call (in a loop, e.g.) that should not normally copy again. But the result of `@<-` will be an object from its frame and will have status 2 when saved, forcing a copy each time. So the change, naturally, is that R 3.0.0 will have a primitive implementation of `@<`. This has been implemented in r-devel (rev. 61544). Please try it out _before_ we issue that version, especially if you own a package that does things related to this question. John PS: Some may have noticed that I didn't mention a fourth approach: fields in a reference class object. The assumption was that we wanted classical, functional behavior here. Reference classes don't have the copy problem but don't behave functionally either. But that is in fact the direction for the other approach. I'll discuss that later, when the corresponding code is available.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
3 days later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130107/9f0835c7/attachment.pl>
On 1/7/13 9:59 AM, Douglas Bates wrote:
Is there a difference in the copying behavior of x at little <- other and x at little[] <- other
Not in the direction you were hoping, as far as I can tell. Nested replacement expressions in R and S are unraveled and done as repeated simple replacements. So either way you end up with, in effect x at little <- something If x has >1 reference, as it tends to, EnsureLocal() will call duplicate(). I think the only difference is that your second form gets you to duplicate the little vector twice. ;-) John
I was using the second form in (yet another!) modification of the internal representation of mixed-effects models in the lme4 package in the hopes that it would not trigger copying of the entire object. The object representing the model is quite large but the changes during iterations are to small vectors representing parameters and coefficients. On Thu, Jan 3, 2013 at 1:08 PM, John Chambers <jmc at r-project.org> wrote:
Martin Morgan commented in email to me that a change to any slot of an object that has other, large slot(s) does substantial computation, presumably from copying the whole object. Is there anything to be done? There are in fact two possible changes, one automatic but only partial, the other requiring some action on the programmer's part. Herewith the first; I'll discuss the second in a later email. Some context: The notion is that our object has some big data and some additional smaller things. We need to change the small things but would rather not copy the big things all the time. (With long vectors, this becomes even more relevant.) There are three likely scenarios: slots, attributes and named list components. Suppose our object has "little" and "BIG" encoded in one of these. The three relevant computations are: x at little <- other attr(x, "little") <- other x$little <- other It turns out that these are all similar in behavior with one important exception--fixing that is the automatic change. I need to review what R does here. All these are replacement functions, `@<-`, `attr<-`, `$<-`. The evaluator checks before calling any replacement whether the object needs to be duplicated (in a routine EnsureLocal()). It does that by examining a special field that holds the reference status of the object. Some languages, such as Python (and S) keep reference counts for each object, de-allocating the object when the reference count drops back to zero. R uses a different strategy. Its NAMED() field is 0, 1 or 2 according to whether the object has been assigned never, once or more than once. The field is not a reference count and is not decremented--relevant for this issue. Objects are de-allocated only when garbage collection occurs and the object does not appear in any current frame or other context. (I did not write any of this code, so apologies if I'm misrepresenting it.) When any of these replacement operations first occurs for a particular object in a particular function call, it's very likely that the reference status will be 2 and EnsureLocal will duplicate it--all of it. Regardless of which of the three forms is used. Here the non-level-playing-field aspect comes in. `@<-` is a normal R function (a "closure") but the other two are primitives in the main code for R. Primitives have no frame in which arguments are stored. As a result the new version of x is normally stored with status 1. If one does a second replacement in the same call (in a loop, e.g.) that should not normally copy again. But the result of `@<-` will be an object from its frame and will have status 2 when saved, forcing a copy each time. So the change, naturally, is that R 3.0.0 will have a primitive implementation of `@<`. This has been implemented in r-devel (rev. 61544). Please try it out _before_ we issue that version, especially if you own a package that does things related to this question. John PS: Some may have noticed that I didn't mention a fourth approach: fields in a reference class object. The assumption was that we wanted classical, functional behavior here. Reference classes don't have the copy problem but don't behave functionally either. But that is in fact the direction for the other approach. I'll discuss that later, when the corresponding code is available.
______________________________**________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-devel<https://stat.ethz.ch/mailman/listinfo/r-devel>
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Hi All, I'm currently trying to write an S4 class that mimics a data.frame, but stores data on disc in HDF5 format. The idea is that the dataset is likely to be too large to fit into a standard desktop machine, and by using subscripts, the user may load bits of the dataset at a time. eg:
myLargeData <- LargeData("/path/to/file")
mySubSet <- myLargeData[1:10, seq(1,15,by=3)]
I've therefore defined by LargeData class thus
LargeData <- setClass("LargeData", representation(filename="character"))
setMethod("initialize","LargeData", function(.Object,filename) .Object at filename <- filename)
I've then defined the "[" method to call a C++ function (Rcpp), opening the HDF5 file, and returning the required rows/cols as a data.frame. However, what if the user wants to load the entire dataset into memory? Which method do I overload to achieve the following?
fullData <- myLargeData class(fullData)
[1] "data.frame" or apply transformations:
myEigen <- eigen(myLargeData)
In C++ I would normally overload the "double" or "float" operator to achieve this -- can I do the same thing in R? Thanks, Chris
Chris,
On Jan 7, 2013, at 6:23 PM, Chris Jewell wrote:
Hi All, I'm currently trying to write an S4 class that mimics a data.frame, but stores data on disc in HDF5 format. The idea is that the dataset is likely to be too large to fit into a standard desktop machine, and by using subscripts, the user may load bits of the dataset at a time. eg:
myLargeData <- LargeData("/path/to/file")
mySubSet <- myLargeData[1:10, seq(1,15,by=3)]
I've therefore defined by LargeData class thus
LargeData <- setClass("LargeData", representation(filename="character"))
setMethod("initialize","LargeData", function(.Object,filename) .Object at filename <- filename)
I've then defined the "[" method to call a C++ function (Rcpp), opening the HDF5 file, and returning the required rows/cols as a data.frame. However, what if the user wants to load the entire dataset into memory? Which method do I overload to achieve the following?
fullData <- myLargeData class(fullData)
[1] "data.frame"
That makes no sense since a <- b is not a transformation, "a" will have the same value as "b" by definition - and thus the same class. If you really meant fullData <- as.data.frame(myLargerData) then you just need to implement the as.data.frame() method for your class. Note, however, that a more common way to convert between a big data reference and native format in its entirety is simply myLargeData[] -- you may want to have a look at the (many) existing big data packages (AFAIR bigmemory uses C++ back-end as well). Also note that indexing is tricky in R and easy to get wrong (remember: negative indices, index by name etc.)
or apply transformations:
myEigen <- eigen(myLargeData)
In C++ I would normally overload the "double" or "float" operator to achieve this -- can I do the same thing in R?
Again, there is no implicit coercion in R (you cannot declare variable type in advance) so it doesn't make sense in the context you have in mind from C++ -- in R the equivalent is simply implementing as.double() method, but I suspect that's not what you had in mind. For generics you can simply implement a method for your class (that does the coercion, for example, or uses a more efficient way). If you cannot define a generic or don't want to write your own methods then it's a problem, because the only theoretical way is to subclass numeric vector class, but that is not possible in R if you want to change the representation because it falls through to the more efficient internal code too quickly (without extra dispatch) for you. Cheers. Simon
Thanks, Chris
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20130107/68a11e51/attachment.pl>
To respond a little more directly to what you seem to be asking for: You would like an "automatic" conversion from your class (you don't give us its name, let's call it frameHDF for now) and "data.frame". In R (and in OOP generally) this sounds like inheritance: you want a frameHDF to be valid wherever a data.frame is wanted. _IF_ that is really a good idea, it can be done by using setIs() to define the correspondence. Then methods for data.frame objects (formal methods at least) will convert the argument automatically. (As noted previously, the simple assignment operation doesn't check types.) However, this doesn't sound like such a good idea. The point of your class is to handle objects too large for ordinary data frames. Converting automatically sounds like a recipe for unpleasant surprises. A more cautious approach would be for the user to explicitly state when a conversion is needed. The general tool for defining this is setAs(), very similar to setIs() but not making things automatic, the user then says as(x, "data.frame") to get conversion. The online documentation for these two functions says some more; also section 9.3 of my 2008 book referenced in the documentation. One more comment. It would be likely that your HDF5 objects have reference semantics--any changes made are seen by all the functions using that object. This is different from R's functional semantics as in S4 classes, and the differences can cause incorrect results in some situations. The more recent reference classes (?ReferenceClasses) were designed to mimic C++, Java, etc style behavior. (They are used in Rcpp to import C++ classes.) John
On Jan 7, 2013, at 3:23 PM, Chris Jewell wrote:
Hi All, I'm currently trying to write an S4 class that mimics a data.frame, but stores data on disc in HDF5 format. The idea is that the dataset is likely to be too large to fit into a standard desktop machine, and by using subscripts, the user may load bits of the dataset at a time. eg:
myLargeData <- LargeData("/path/to/file")
mySubSet <- myLargeData[1:10, seq(1,15,by=3)]
I've therefore defined by LargeData class thus
LargeData <- setClass("LargeData", representation(filename="character"))
setMethod("initialize","LargeData", function(.Object,filename) .Object at filename <- filename)
I've then defined the "[" method to call a C++ function (Rcpp), opening the HDF5 file, and returning the required rows/cols as a data.frame. However, what if the user wants to load the entire dataset into memory? Which method do I overload to achieve the following?
fullData <- myLargeData class(fullData)
[1] "data.frame" or apply transformations:
myEigen <- eigen(myLargeData)
In C++ I would normally overload the "double" or "float" operator to achieve this -- can I do the same thing in R? Thanks, Chris
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Okay, thanks for the input, All. I'd settled on the explicit coercion as.data.frame as well as the myObject[] syntax which makes a lot of sense. I'd also like to implement an as.double() method. However, I'm having trouble embedding this into my package. In the R file I have:
setMethod("as.double", "HD5Proxy", function(x) as.double(x[]))
and exportMethods(as.double) in my NAMESPACES file. However, on checking the package, I get an error saying that method "as.double" cannot be found. I noticed that the setMethod command actually returns a character vector with "as.numeric", so I'm assuming this is the problem. How do I explicitly export my as.double method?
Thanks,
Chris
1 day later
Chris Jewell <chris.jewell at warwick.ac.uk>
on Wed, 9 Jan 2013 13:28:49 +1300 writes:
"CJ" == Chris Jewell <chris.jewell at warwick.ac.uk>
on Wed, 9 Jan 2013 13:28:49 +1300 writes:
CJ> Okay, thanks for the input, All. I'd settled on the explicit coercion as.data.frame as well as the myObject[] syntax which makes a lot of sense. I'd also like to implement an as.double() method. However, I'm having trouble embedding this into my package. In the R file I have:
CJ> setMethod("as.double", "HD5Proxy", function(x) as.double(x[]))
CJ> and exportMethods(as.double) in my NAMESPACES file. However, on checking the package, I get an error saying that method "as.double" cannot be found. I noticed that the setMethod command actually returns a character vector with "as.numeric", so I'm assuming this is the problem. How do I explicitly export my as.double method?
As you've noticed above, `` R prefers as.numeric ''
and that is what you should be dealing with instead :
It's unsatisfactory I agree, but that's what it currently is:
setMethod("as.numeric", ....
and exportMethods(as.numeric)
Martin