[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?
On 09/20/2012 06:47 PM, Michael Lawrence wrote:
Thanks Vince, I think we're on the same page. I agree that a set of ranges-of-interest are not always appropriate, and that the most basic structure would be a table of samples X assays, with missing values. Ranges-of-interest can be layered on top when desired. There are many aspects of SummarizedExperiment that I would want to carry over, especially the idea of metadata, on the samples, assays, and features (when applicable). Michael On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey <stvjc at channing.harvard.edu>wrote:
I'll comment briefly because I think this is a strategically important topic and I have done a little bit on integration in various forms. My view of SummarizedExperiment is that it updates the eSet concept to promote range-based indexing of assay features. The 'assays' component is limited to matrix/array like things and my sense is that the
It might help to nail down a more precise 'API' for what can be in the assays slot, but I think it would be definitely array-like; no need for it to be an actual 'matrix', though.
"Summarized" implies that the intention is for a memory-tractable, serializable reduction of an experiment applied to all of a fixed set of samples. I felt that what Michael was describing departs significantly from these conditions/aims in various ways -- there are multiple assays, possibly at different stages of summarization, and one wants a coherent path to interaction with these, requiring less uniformity of structure. Entities to be covered are, roughly, a set of biological
A major task I think would be management of on-disk resources, guaranteeing in some way that the object is not tied to some fragile local disk structure. The heterogeneity of data types also seems like a significant departure.
samples, mostly assayed in the same ways, but the assays do not imply a common set of measurements on a fixed set of ranges. One possible term for the data structure described by Michael is "ExperimentHub". This
a nice term. Martin
would include references to various external data resources and it would have methods for traversing the resources for certain objectives. Instead of nesting the SummarizedExperiment structures, we could think of certain traversals culminating in SummarizedExperiment instances. I think this would lead to high-level workflow prescriptions that could be broadly applicable -- say you have VCFs and BAMs on a collection of samples with some gaps, start with an ExperimentHub consisting of path specifications and on this you could derive some basic statistics on data availability. You'd want to have a little more detail on the biology from which the files arose early on, to help organize the high-level description. For example, I assume you might have separate VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different cell types, and from some ChIP-seq ... some samples have all, some have only a few of these assays, and spelling all this out at an early stage would be very useful. On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Dear all,
Here is a problem that has been bouncing around in my head, and I
thought it might be time for some discussion. Maybe others have
already figured this out.
We are often interested in the same genomic regions over multiple
datasets and multiple samples. Typically, the data are the output of
a large analysis pipeline. On the surface, SummarizedExperiment
is very close to the right data structure, but there are some issues.
Often, these data will be too large to load completely into memory,
so we need objects that point to out-of-memory storage. This would
need to be matrix-like, like a BamViews object, but there would be
redundancy between the ranges in the BamViews and the ranges in
the rowData. Thus, the BamViews could be created from a
BamFileList dynamically when the user retrieves an assay, or there
would need to be consistency checking to make sure the same ranges
are being described (would be a performance drain).
Another issue is that certain samples may only be included in
certain assays. In the simple matrix case, we could handle this with
NA values. The out-of-memory references will need to support a
similar semantic. So far, we have not allowed NA in the List
classes, but I think we might have to move in the direction. In some
ways, we are stretching the definition of SE here, because we might
have multiple experiments, not just one.
Perhaps we are no longer talking about a summary but are focusing
more on integration, i.e., we are talking about an
IntegratedExperiment. But I think SummarizedExperiment could be
coerced into this role.
Let's get this started with a use-case, here is one related to variant
calling:
Assume we have some output from a sequence analysis pipeline,
including alignments, coverage and variant calls. We want to
validate exome variants in RNA, but only where genes are expressed
(high coverage in RNA). Now assume that a SE has been constructed
for the exome variant positions and all of the samples. The assays
are the exome calls (VCF), the RNA calls (VCF), and the RNA
coverage (BigWig). The algorithm needs to extract the variant
information as GRanges, and the coverage information as an Rle.
First, we extract the exome variants:
> exome.variants <- assay(se, "exome.variants")
What would exome.variants be? In oncology at least, it is way more
efficient to output a VCF per sample and then merge them at
analysis time. Let us assume that there is one VCF file per sample
and internally there is a VcfFileList (I think Vince has shown
something like this). The exome.variants object needs to carry
along the positions from the SE rowData. The minimum conversion
would be to something like a VcfViews object (as in BamViews). The
VcfViews object should try to provide the same API has VCF, where
it makes sense. There are obvious issues like, would the column
indexing be by sample or by file? Conceptually at least, the
VcfViews is going to be very similar to a merge of multiple VCF
files into a VCF object. Would the return value really be a
VcfViews, or could it coerced directly to a VCF? The coercion may
be complicated, so it may be best to leave that as a second step,
after pulling out the assay.
Alternatively, if there is a single VCF file, the data could be
stored as a VCF, since it is matrix-like, after all. So SE's could
be nested. This would obviously be most efficient space-wise if the
VCF class were implemented on top of a tabix-indexed VCF, with
on-demand materialization. But maybe it is simpler to just use a
length-one VcfFileList/VcfViews for this? (As an aside, it would be
nice if there were some general abstraction for variant data,
whether stored in VCF, GVF, or some other format/database).
Then for the coverage:
> rna.coverage <- assay(se, "rna.coverage")
Following the conventions above, rna.coverage would be a
BigWigViews, which might have an API like viewSums, viewMaxs, etc
for getting back a matrix of coverage summaries, possibly as a
SummarizedExperiment?
So that's all I have for now.
Thanks,
Michael
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793