Skip to content

[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?

7 messages · Vincent Carey, Martin Morgan, Kasper Daniel Hansen +2 more

#
I'll comment briefly because I think this is a strategically important topic and
I have done a little bit on integration in various forms.

My view of SummarizedExperiment is that it updates the eSet concept to
promote range-based indexing of assay features.  The 'assays' component
is limited to matrix/array like things and my sense is that the "Summarized"
implies that the intention is for a memory-tractable, serializable reduction of
an experiment applied to all of a fixed set of samples.

I felt that what Michael was describing departs significantly from
these conditions/aims
in various ways -- there are multiple assays, possibly at different
stages of summarization, and one
wants a coherent path to interaction with these, requiring less uniformity of
structure.  Entities to be covered are, roughly, a set of biological
samples, mostly assayed in the same ways, but the assays do not imply a common
set of measurements on a fixed set of ranges.

One possible term for the data structure described by Michael is
"ExperimentHub".  This
would include references to various external data resources and it
would have methods
for traversing the resources for certain objectives.  Instead of
nesting the SummarizedExperiment
structures, we could think of certain traversals culminating in
SummarizedExperiment instances.

I think this would lead to high-level workflow prescriptions that
could be broadly applicable --
say you have VCFs and BAMs on a collection of samples with some gaps,
start with an ExperimentHub
consisting of path specifications and on this you could derive some
basic statistics on data availability.  You'd want to have a little
more detail on the biology from which the files arose early on, to
help organize the
high-level description.  For example, I assume you might have separate
VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
cell types, and from some ChIP-seq ... some samples have all, some
have only a few of these assays, and spelling all this out at an early
stage would be very useful.

On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
#
On 09/20/2012 06:47 PM, Michael Lawrence wrote:
It might help to nail down a more precise 'API' for what can be in the 
assays slot, but I think it would be definitely array-like; no need for 
it to be an actual 'matrix', though.
A major task I think would be management of on-disk resources, 
guaranteeing in some way that the object is not tied to some fragile 
local disk structure.

The heterogeneity of data types also seems like a significant departure.
a nice term.

Martin

  
    
#
So here is my 2 cents.  Perhaps a bit rambling, but it is probably
better to weigh in now.

There are two issues here.  One has to do with the right
representation of multiple classes of experiments and one has to do
with having the data on disk instead of in memory.

The last one first.  Michael is right to note that this (having an
on-disk representation) would be highly useful.  Some class that has a
pointer to a file and perhaps a getData method which would pick out a
region and return a SummarizedExperiment would be great.  This would
need to support BAM, bigWig and bigBed at least, and allow for each
sample being in a different file.  This is very much what we tried to
do with Genominator, except that we used a special file format instead
of just being able to point to different file types.

Now for the first one.  The use case I see is where you have a number
of assays on the same individuals but also a number of (different or
the same) assays on other samples.  Let us for example say that you
have done RNA-seq on some people and you want to look at ENCODE
chip-seq data in that region.

I think of this as a _collection_ of SummarizedExperiments.  A
collection because all assays in a SummarizedExperiment need to share
the same ranges.  And if you really have different assays, you may
have copy number (where each range is likely to be long), seq.
expression and chipseq.  They all have different types of structure.
One solution to bring them all into a set of shared ranges is to
essentially do a disjoint on the ranges, but I don't like that.  I
think it will be important to store a single long copy number change
as a single range and not as a union of ranges.

I think it is important to allow different samples for different
experiments (and in fact I think this will be more common - say you
want to contrast you data with other public data in the same region -
this is unlikely to be the same samples).  And I don't think this
should be done by having a lot of NA's in the matrices.

So I think we need something like a list of SummarizedExperiments,
perhaps with a joint sampleData (how a joint sample data is mapped to
multiple assays will need to be thought about).  We might also have a
joint GRanges which signifies "this is the region(s) we have data on",
but we should still retain the individual ranges for each experiment.
Something like

dataRanges
  A GRanges, just telling up what is essentially the union of the
rowData in the SummarizedExperiment below, or perhaps bigger.
copyNumber
  SummarizedExperiment, 3 samples
  has assays "copyNumber" and perhaps "control"
TF binding
  SummarizedExperiment, 10 samples
  has assays "TF1", .., "TF5" and "input"
SampleData
  some kind of joint sample phenodata.

Kasper
On Fri, Sep 21, 2012 at 9:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote: