Skip to content

[Bioc-devel] Numeric Operation on DataFrame

3 messages · Dario Strbenac, Michael Lawrence, Hervé Pagès

#
Good day,

Would it be useful to provide the same operations which can be done to a data.frame for a DataFrame in a future release of S4Vectors? For example,

dataTable <- data.frame(aFeature = 1:5, anotherFeature = 5:1)
colMeans(dataTable)
#  aFeature anotherFeature 
#         3              3
dataTableS4 <- DataFrame(aFeature = 1:5, anotherFeature = 5:1)
colMeans(dataTableS4)
    Error in colMeans(dataTableS4) : 
        'x' must be an array of at least two dimensions

--------------------------------------
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
#
Please be more specific about the desired operations, or, better, submt a
pull request with them. colMeans() in particular was intentionally omitted
because it depends on having homogeneous data, which is better suited for a
matrix, not a data frame.

On Mon, Jan 15, 2018 at 10:00 PM, Dario Strbenac <dstr7320 at uni.sydney.edu.au

  
  
#
Hi,

I think I remember it was once suggested on this list that DataFrame
objects with numeric columns could support math/summarization
operations, like data.frame objects do (can't find the thread
to provide the link, sorry).

I'll mention that wrapping a DataFrame object (or any matrix-like or
array-like object) in a DelayedArray object is one way to enable
this:

   library(DelayedArray)
   M <- DelayedArray(dataTableS4)
   colMeans(M)
   #      aFeature anotherFeature
   #             3              3

This should not copy the DataFrame so should be more memory efficient
than doing as.data.frame() on it. In addition it will transparently
use the internal DelayedArray machinery i.e. will delay some
operations (e.g. subsetting and log() in colMeans(log(M[-1, ]))
are delayed) and use block-processing for non-delayed operations
(e.g. colMeans).

Note that wrapping a DataFrame with Rle columns in a DelayedArray
object also works.

Pete's DelayedMatrixStats package will extend DelayedArray capabilities
by giving you access to all the summarization functions defined in
the matrixStats package.

That being said, it would be nice if math/summarization operations
worked directly on DataFrame objects like they do on ordinary
data frames. This could naturally be extended to DataFrame objects
with numeric Rle columns.

H.
On 01/16/2018 06:29 AM, Michael Lawrence wrote: