Skip to content
Prev 4257 / 21312 Next

[Bioc-devel] pointer and big matrix in R

On 04/12/2013 12:08 PM, Nicolas Servant wrote:
Practically the answer is no, there is no easy way to create a pointer between A 
and B.

## Background

R as copy-on-change semantics so actually

     x <- seq(1, 10)
     y <- x

has associated the symbol `y` with the address pointed to by `x` without any 
memory copy (in this sense `y` is a pointer to `x`), but at the same time marked 
the memory location at `x` such that if either `x` or `y` is changed, a memory 
copy occurs. S4 is implemented in a way that makes it difficult to exploit this 
copy-on-change efficiency.

The 'classic' approach to creating reference semantics (where two symbols 
actually reference the same data) has been to assign a variable to an 
environment, and to pass the environment around. An R environment has 
reference-based semantics.

     e = new.env(parent=emptyenv())
     e[["x"]] = seq(1, 10)
     f = e ## changes to e[["x"]] are reflected in f[["x"]] and vice versa

This approach has been made more formal with so-called reference classes, 
`?ReferenceClasses`

    .R = setRefClass("R", fields=list(x="matrix"))
     r = .R$new(x=matrix(1:10, 5))
     s = r
     r$x[] = log(r$x[])  ## changes to r$x change s$x

The main draw-back to reference classes is that changing `r` surprises the R 
user with its 'action-at-a-distance' -- `s` also changes. The user really does 
not expect the value of one R variable to change when the value of another R 
variable is changed. Also while the `$` accessor and contained methods of 
reference classes are more familiar to non-R programmers, they are not 
necessarily the way R users are expecting to interact with objects (e.g., the S3 
object `m = matrix()` is queried with `dim(m)`, not `m$dim()`).

## Implementing copy-on-change reference classes

A solution is to implement a reference class that nonetheless has copy-on-change 
semantics, so that one gets the benefit of pass-by-reference without the 
surprise of action-at-a-distance. The implementation can be done so that the 
user interacts with R objects in a way that is familiar to R users.

The design pattern is actually quite simple; the following uses the convention 
that `.` prefixes functions that are not meant to be called directly by the 
user. We start with a function that makes a _shallow_ copy of a reference class, 
then updates an arbitrary number of reference class fields.

     .clone <-
         function(x, ...)
     {
         args <- list(...)
         x <- x$copy()
         for (nm in names(args))
             x$field(nm, args[[nm]])
         x
     }

The `x$copy()` creates a new instance of the reference class, with references to 
the original fields. The `for` loop replaces the fields that we've provided, but 
only in our new copy.

We use `.clone` whenever we want to modify a field of `x`; only the modified 
fields are copied.

## Example

Here's a reference class representation of your HTCexp class as simplified in 
your email, with three fields.

     .HTCexp <- setRefClass("HTCexp",
         fields = list(data = "matrix", x_intervals = "GenomicRanges",
           y_intervals = "GenomicRanges"))

A relatively recent addition to R is that `setRefClass` (and `setClass`) return 
a generator function that can be used to create instances of the class. We use 
the generator in a function that is exposed to the user, taking the opportunity 
to provide some hints about expected arguments and to simplify object construction

     HTCexp <-
         function(data = matrix(0, 0, 0), x_intervals = GRanges(),
                  y_intervals = x_intervals, ...)
     {
         .HTCexp$new(data=data, x_intervals = x_intervals,
                     y_intervals = y_intervals, ...)
     }

This is enough to create an object

     htc0 <- HTCexp()
     htc <- HTCexp(matrix(1:25, 5), GRanges("A", IRanges(1:5, width=10)))

We then provide a familiar interface to the class, e.g., with some getters:

     setGeneric("htc_data", function(x, ...) standardGeneric("htc_data"))
     setGeneric("x_intervals", function(x, ...) standardGeneric("x_intervals"))
     setGeneric("y_intervals", function(x, ...) standardGeneric("y_intervals"))

     setMethod("htc_data", "HTCexp", function(x, ...) x$data)
     setMethod("x_intervals", "HTCexp", function(x, ...) x$x_intervals)
     setMethod("y_intervals", "HTCexp", function(x, ...) x$y_intervals)

The user accesses components of `HTCexp` with these S4 methods as 
`x_intervals(htc)`, but internally we access the data using reference class 
techniques, e.g, `x$x_intervals`.

If we write a method or function that modifies our object, we need to maintain 
the illusion of copy-on-change by creating a clone updated with the data that we 
modify

     setMethod("[", "HTCexp", function(x, i, j, ..., drop=TRUE) {
         .clone(x, data=htc_data(x)[i,j, ...], x_intervals=x_intervals(x)[i],
               y_intervals=y_intervals(x)[j])
     })

Neat, `htc[1:2, 3:4]`. If you think about it, we aren't necessarily reducing the 
amount of copying involved -- we modified all fields of the object. In practice, 
I believe that we _have_ substantially reduced copying, for complicated reasons.

Let's implement the `Math` group generic (see `?GroupGenericFunctions`), so that 
we can call functions like `log` on our `HTCexp` object (I have no idea whether 
this is sensible in your example...).

     setMethod("Math", "HTCexp", function(x) {
         data <- callGeneric(htc_data(x))
         .clone(x, data=data)
     })

When the user says, e.g., `log(htc)`, we are updating the `data` field of the 
object, the GRanges in `x_intervals` and `y_intervals` are not copied.

## Design decisions

The design I offered above made the `HTCexp` class itself a copy-on-change 
reference class. In contrast, `SummarizedExperiment` creates a 
`ShallowSimpleListAssays` copy-on-change reference class, and uses this as a 
slot in the plain S4 `SummarizedExperiment`. There is no technical problem with 
this, and the implementation seemed to be effective -- the 'big' data is in the 
`Assays` slot, and we've gained significant performance without adding too much 
complexity. Your own use case was motivated by identical `GRanges` x- and 
y-intervals. Perhaps the solution for you is to construct a `GRangesRef` class 
and use these as slots in your HTCexp class?

As an additional note, there are differences between reference and S4 classes. 
One thing I've stumbled over is that validity method (`setValidity`) are not 
automatically invoked for reference classes; one approach is to write such 
methods, and invoke them at appropriate locations (e.g., after object 
construction in the `HTCexp` function).


Hope this helps; I'd like to add this as a How To page to the Bioconductor web 
site http://bioconductor.org/developers/ so if there are comments, corrections, 
etc., please don't hesitate to let me know.

Martin