Skip to content

Shallow copies

2 messages · Matthieu Gomez, Henrik Bengtsson

#
I have a question about shallow copies in R. Since R 3.1.0, subsetting
a dataframe with respect to its columns no longer result in deep
copies. This is an amazing change in my opinion. Now, subsetting a
data.frame by rows (or subsetting a matrix by columns or rows) still
does deep copies. In particular, it is my understanding that running a
command on a very large subset of rows (say "sum" or "biglm" on non
outliers observations) results in a deep copy of these rows first,
which can require twice as much the memory of the original
data.frame/matrix. If this is correct, I would be very interested to
know more on whether this behavior can/may change in future versions
of R.

Thanks a lot!,
Matthew
#
On Tue, Sep 30, 2014 at 2:20 PM, Matthieu Gomez
<gomez.matthieu at gmail.com> wrote:
I let the experts comment on this, but subsetting/reshuffling columns
in data.frame:s sound easy compared with what you're asking for.
Columns of a data frame are basically just elements in a list and they
don't have to be contiguous in memory whereas the elements in a matrix
(of a basic data type) needs to be contiguous in memory.

However, somewhat related: Having lots of these temporary copies can
be quite time consuming for the garbage collector, so it's not just
about the memory but also about the overall processing time.  One of
the next improvements in the 'matrixStats' package is to add support
for specifying subsets of rows/columns to operate over - for the
purpose of avoiding the auxiliary copies that you talk about, e.g.

  cols <- c(1:14, 87:103)
  rows <- seq(from=1, to=nrow(X), by=2)
  y <- rowMedians(X, rows=rows, columns=cols)

instead of

  y <- rowMedians(X[rows,cols])

It's a fairly simple task to implement, but I've got lots on my plate
so I don't know when this will happen. (I welcome contributions via
github.com/HenrikBengtsson/matrixStats.) Similar methods in R (e.g.
rowSums()) could gain from this too.

/Henrik
(matrixStats)

PS. Code compilation could translate rowMedians(X[rows,cols]) to
rowMedians(X, rows=rows, columns=cols) but that's far in the future (I
think).