Skip to content

[Bioc-devel] SummarizedExperiment vs ExpressionSet

10 messages · Wolfgang Huber, Laurent Gatto, Michael Lawrence +4 more

#
A colleague and I are designing a package for quantitative proteomics data, and we are debating whether to base it on the SummarizedExperiment or the ExpressionSet class. 

There is no immediate use for the ranges aspect of SummarizedExperiment, so that would have to be carried around with NAs, and this is a parsimony argument for using ExpressionSet instead. OTOH, the interface of SummarizedExperiment is cleaner, its code more modern and more likely to be updated, and users of the Bioconductor project are likely to benefit from having to deal with a single interface that works the same or similarly across packages, rather than a variety of formats; which argues that new packages should converge towards SummarizedExperiment(?s interface).

Are there any pertinent insights from this group?

Thanks and best wishes
Wolfgang
#
On 26 November 2014 14:59, Wolfgang Huber wrote:

            
Instead of ExpressionSet, you could use MSnbase::MSnSet, which is
essentially an ExpressionSet for quantitative proteomics (i.e it has a
MIAPE slot, instead of MIAME for example).

Ideally, a SummarizedExperiment for proteomics would use peptide/protein
ranges, which is in the pipeline, as far as I am concerned. When that
becomes available, there should be infrastructure to coerce and MSnSet
(and/or other relevant data) into an SummarizedExperiment.

Hope this helps.

Best wishes,

Laurent

  
    
#
Hi all,

I believe there is a strong need for an object that organizes a collection
of rectangular data (matrices, etc.) with metadata on the rows and
columns.  Can SummarizedExperiment inherit from something simpler that has
a DataFrame as rowData?  (I believe GenomicRanges should inherit from
DataTable, rather than Vector, and subset as x[i,j], but maybe that's
getting a bit off topic.)  I often see people stuffing arbitrary data into
an ExpressionSet and calling one of the assays "exprs" as a work-around.

Regards,

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com
On Wed, Nov 26, 2014 at 7:19 AM, Laurent Gatto <lg390 at cam.ac.uk> wrote:

            

  
  
#
On Wed, Nov 26, 2014 at 9:07 AM, Peter Haverty <haverty.peter at gene.com>
wrote:
(I believe GenomicRanges should inherit from
Have to disagree on that. A GRanges is a vector of ranges; a table is a
list of vectors all of the same length. Different things. There was a lot
of thought invested in that. But it does subset as x[i,j], so in theory
SummarizedExperiment could be generalized to contain something with the
contract of 2D extraction.

  
  
#
so as a simple experiment, I did the following:

library(GenomicRanges)
bar <- matrix(rnorm(100), ncol=10)
colnames(bar) <- as.character(1:10)
rownames(bar) <- letters[1:10]
foo <- SummarizedExperiment(assays=list(bar=bar))

rowData(foo)
## GRangesList object of length 10:
## $a
## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##
## $b
## GRanges object with 0 ranges and 0 metadata columns:
##      seqnames ranges strand
##
## $c
## GRanges object with 0 ranges and 0 metadata columns:
##      seqnames ranges strand
##
## ...
## <7 more elements>

colData(foo)
## DataFrame with 10 rows and 0 columns

This got me to thinking, why not have an emptyRanges class, or else the
ability to index a bunch of NULL ranges without a lot of hoohah?  The
defaults mostly do what they're supposed to; why not have a compact
representation of empty rowData as for empty colData (i.e., a DataFrame
with 0 rows)?  Or is a GRangesList of empty GRanges as compact as it is
practicable to get for this purpose?

Just pondering what the lowest-impact solution to the problem at hand might
be.


Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>

On Wed, Nov 26, 2014 at 9:07 AM, Peter Haverty <haverty.peter at gene.com>
wrote:

  
  
#
GRangesList is very compact, so this would definitely get the job done. But
having an empty range is not the same as a NA, nor does it mean that ranges
are "irrelevant". There are definitely times, especially as we extend
beyond genomics, when we need something more general, as Pete suggests.

As an aside I think there is an interesting structural relationship between
something like an eSet and a pivot table in a spreadsheet, except an eSet
has multiple measurement tables and the column/row annotations are not just
for aggregation. If we start to think more broadly, we should consider such
specializations and try to unify them into a single framework.



On Wed, Nov 26, 2014 at 9:37 AM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:

  
  
#
One thing that?s become apparent working on epivizr is that it may be useful to think about ?rowData? in a SummarizedExperiment as having two distinct components: row coordinates and row metadata. In the current class rowData is a ?GenomicRanges? which contains both coordinates (the ranges) and metadata (mcols(rowData)). In metagenomics (the other application my group works a lot with), we think of the taxonomy as providing coordinates. The distinction is worthwhile thinking about since there are certain operations we do on coordinates that we don?t do with metadata (and conversely).




Thinking about it this way, the ?ExpressionSet? object would be data without coordinates. So, I would avoid making ?GenomicRanges? behave like ?DataFrame? since this distinction between coordinates and metadata is lost. The ?emptyRanges? proposal gets closer to this since this corresponds to ?no coordinates?, but it may be worth thinking in the long term on making the coordinate/metadata distinction more general.




Hector

On Wed, Nov 26, 2014 at 12:38 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:
#
Hi guys,

I like the idea of separating the row data from the row ranges.
This could be formalized with 2 distinct accessors: rowData() and
rowRanges(). The former would return a DataFrame, and the latter
NULL or a range-based object (GRanges or GRangesList).
I don't think there is the need for an emptyRanges class.

H.
On 11/26/2014 11:40 AM, Hector Corrada Bravo wrote:

  
    
#
OK, GRanges as vector that does overlap stuff makes sense, but I think
putting a DataFrame of metadata on that confuses the purpose of the
object.  How about a "GRangesTable" that inherits from both GenomicRanges
and DataTable?  It would be a DataFrame with a fancy index.  The DataFrame
API would make stuff like colnames work (rather than needing
colnames(mcols(x)) ). If this were used as the rowData for
SummarizedExperiment, then a plain DataFrame could be made to work too.

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Wed, Nov 26, 2014 at 9:33 AM, Michael Lawrence <lawrence.michael at gene.com

  
  
#
The two objects have conflicting APIs. For example, 1D extraction indexes
into the ranges for a GRanges, but into the columns for a table. So I would
not recommend multiple inheritance. Instead, define something new with the
semantics you want and use composition. Maybe just a subclass of DataFrame
that adds a GenomicRanges slot.

On Wed, Nov 26, 2014 at 1:55 PM, Peter Haverty <haverty.peter at gene.com>
wrote: