Skip to content

Class that wraps Data Frame

4 messages · Ramiro Barrantes, David Winsemius, Bert Gunter +1 more

#
On Aug 31, 2012, at 5:57 AM, Ramiro Barrantes wrote:

            
You could argue that the entire BioConductor project represents such an effort. It makes extensive use of S4 methods. I'm not a user so cannot readily point to examples of S4 functions that have set. and get. methods for particular sorts of dataframes, but I suspect you can pose the same question on the BioC mailing list and get a more informed answer.
#
I guess there are two issues with data.frame. It comes with more than 
you probably want to support (e.g., list and matrix- like subsetter [, 
the user expecting to be able to independently modify any column). And 
it comes with less than you'd like (e.g., support for a 'column' of S4 
objects). By making a class that contains ('is a') data.frame, you 
commit to both limitations.

You're probably using data.frame as a way to implement some basic 
restrictions -- equal-length columns, for instance. But there are 
additional restrictions, too, columns x, y, z must be present and of 
type integer, character, numeric respectively. For this scenario one is 
better off implementing an S4 class (which provides type checking and 
required columns), a validity method (for enforcing the equal-length 
constraint), accessors, and sub-setting following the semantic that 
you'd like to support, e.g., just along the length of the required slots.

The richest place for this in Bioconductor is the IRanges package, 
though it can be a bit daunting from an architecture point of view. A 
couple of things to point to. One is the DataFrame class, which is like 
a data.frame but supporting a broader (in particular S4) set of columns 
and allowing 'metadata' (actually, DataFrame, so recursive) on each 
column. It is relevant if it is important to maintain S4 classes in a 
data.frame-like structure.

Another is the IRanges class, which in some ways fits your overall use 
case. It is basically a rectangular data structure, but with required 
'columns' (the start and width of the range) and then arbitrary columns 
the user can add. It's implemented with slots for start and width, and 
then 'has a' slot containing a DataFrame as 'metadata columns' (the 
actual implementation is more complicated than this). There are start 
and width accessors. Sub-setting is always list-like 
(single-dimensional, along the ranges). Users wanting to access one of 
'their' columns use $ or extract the metadata columns (via mcols()) as a 
DataFrame and then work on that. Maybe it's worth pointing out that the 
basic definitions are column-oriented, an IRanges instance contains 
start and width vectors; there is no 'IRange' class.

The GRanges class (in the GenomicRanges package) 'has a' IRanges, but 
adds additional required slots ('seqnames' to reference the names of the 
chromosome sequences to which the ranges refer, 'strand' to indicate the 
strand to which the range belongs, etc.). So the pattern here avoids the 
'is a' relationship that simple class extension would imply.

The IRanges package is at

   http://bioconductor.org/packages/devel/bioc/html/IRanges.html

I've described the 'devel' version of Bioconductor

   http://bioconductor.org/developers/useDevel/

Martin
On 08/31/2012 08:39 AM, Bert Gunter wrote: