An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120920/93a4fb61/attachment.pl>
[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?
7 messages · Vincent Carey, Martin Morgan, Kasper Daniel Hansen +2 more
I'll comment briefly because I think this is a strategically important topic and I have done a little bit on integration in various forms. My view of SummarizedExperiment is that it updates the eSet concept to promote range-based indexing of assay features. The 'assays' component is limited to matrix/array like things and my sense is that the "Summarized" implies that the intention is for a memory-tractable, serializable reduction of an experiment applied to all of a fixed set of samples. I felt that what Michael was describing departs significantly from these conditions/aims in various ways -- there are multiple assays, possibly at different stages of summarization, and one wants a coherent path to interaction with these, requiring less uniformity of structure. Entities to be covered are, roughly, a set of biological samples, mostly assayed in the same ways, but the assays do not imply a common set of measurements on a fixed set of ranges. One possible term for the data structure described by Michael is "ExperimentHub". This would include references to various external data resources and it would have methods for traversing the resources for certain objectives. Instead of nesting the SummarizedExperiment structures, we could think of certain traversals culminating in SummarizedExperiment instances. I think this would lead to high-level workflow prescriptions that could be broadly applicable -- say you have VCFs and BAMs on a collection of samples with some gaps, start with an ExperimentHub consisting of path specifications and on this you could derive some basic statistics on data availability. You'd want to have a little more detail on the biology from which the files arose early on, to help organize the high-level description. For example, I assume you might have separate VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different cell types, and from some ChIP-seq ... some samples have all, some have only a few of these assays, and spelling all this out at an early stage would be very useful. On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
Dear all, Here is a problem that has been bouncing around in my head, and I thought it might be time for some discussion. Maybe others have already figured this out. We are often interested in the same genomic regions over multiple datasets and multiple samples. Typically, the data are the output of a large analysis pipeline. On the surface, SummarizedExperiment is very close to the right data structure, but there are some issues. Often, these data will be too large to load completely into memory, so we need objects that point to out-of-memory storage. This would need to be matrix-like, like a BamViews object, but there would be redundancy between the ranges in the BamViews and the ranges in the rowData. Thus, the BamViews could be created from a BamFileList dynamically when the user retrieves an assay, or there would need to be consistency checking to make sure the same ranges are being described (would be a performance drain). Another issue is that certain samples may only be included in certain assays. In the simple matrix case, we could handle this with NA values. The out-of-memory references will need to support a similar semantic. So far, we have not allowed NA in the List classes, but I think we might have to move in the direction. In some ways, we are stretching the definition of SE here, because we might have multiple experiments, not just one. Perhaps we are no longer talking about a summary but are focusing more on integration, i.e., we are talking about an IntegratedExperiment. But I think SummarizedExperiment could be coerced into this role. Let's get this started with a use-case, here is one related to variant calling: Assume we have some output from a sequence analysis pipeline, including alignments, coverage and variant calls. We want to validate exome variants in RNA, but only where genes are expressed (high coverage in RNA). Now assume that a SE has been constructed for the exome variant positions and all of the samples. The assays are the exome calls (VCF), the RNA calls (VCF), and the RNA coverage (BigWig). The algorithm needs to extract the variant information as GRanges, and the coverage information as an Rle. First, we extract the exome variants:
> exome.variants <- assay(se, "exome.variants")
What would exome.variants be? In oncology at least, it is way more efficient to output a VCF per sample and then merge them at analysis time. Let us assume that there is one VCF file per sample and internally there is a VcfFileList (I think Vince has shown something like this). The exome.variants object needs to carry along the positions from the SE rowData. The minimum conversion would be to something like a VcfViews object (as in BamViews). The VcfViews object should try to provide the same API has VCF, where it makes sense. There are obvious issues like, would the column indexing be by sample or by file? Conceptually at least, the VcfViews is going to be very similar to a merge of multiple VCF files into a VCF object. Would the return value really be a VcfViews, or could it coerced directly to a VCF? The coercion may be complicated, so it may be best to leave that as a second step, after pulling out the assay. Alternatively, if there is a single VCF file, the data could be stored as a VCF, since it is matrix-like, after all. So SE's could be nested. This would obviously be most efficient space-wise if the VCF class were implemented on top of a tabix-indexed VCF, with on-demand materialization. But maybe it is simpler to just use a length-one VcfFileList/VcfViews for this? (As an aside, it would be nice if there were some general abstraction for variant data, whether stored in VCF, GVF, or some other format/database). Then for the coverage:
> rna.coverage <- assay(se, "rna.coverage")
Following the conventions above, rna.coverage would be a
BigWigViews, which might have an API like viewSums, viewMaxs, etc
for getting back a matrix of coverage summaries, possibly as a
SummarizedExperiment?
So that's all I have for now.
Thanks,
Michael
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120920/780a7947/attachment.pl>
On 09/20/2012 06:47 PM, Michael Lawrence wrote:
Thanks Vince, I think we're on the same page. I agree that a set of ranges-of-interest are not always appropriate, and that the most basic structure would be a table of samples X assays, with missing values. Ranges-of-interest can be layered on top when desired. There are many aspects of SummarizedExperiment that I would want to carry over, especially the idea of metadata, on the samples, assays, and features (when applicable). Michael On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey <stvjc at channing.harvard.edu>wrote:
I'll comment briefly because I think this is a strategically important topic and I have done a little bit on integration in various forms. My view of SummarizedExperiment is that it updates the eSet concept to promote range-based indexing of assay features. The 'assays' component is limited to matrix/array like things and my sense is that the
It might help to nail down a more precise 'API' for what can be in the assays slot, but I think it would be definitely array-like; no need for it to be an actual 'matrix', though.
"Summarized" implies that the intention is for a memory-tractable, serializable reduction of an experiment applied to all of a fixed set of samples. I felt that what Michael was describing departs significantly from these conditions/aims in various ways -- there are multiple assays, possibly at different stages of summarization, and one wants a coherent path to interaction with these, requiring less uniformity of structure. Entities to be covered are, roughly, a set of biological
A major task I think would be management of on-disk resources, guaranteeing in some way that the object is not tied to some fragile local disk structure. The heterogeneity of data types also seems like a significant departure.
samples, mostly assayed in the same ways, but the assays do not imply a common set of measurements on a fixed set of ranges. One possible term for the data structure described by Michael is "ExperimentHub". This
a nice term. Martin
would include references to various external data resources and it would have methods for traversing the resources for certain objectives. Instead of nesting the SummarizedExperiment structures, we could think of certain traversals culminating in SummarizedExperiment instances. I think this would lead to high-level workflow prescriptions that could be broadly applicable -- say you have VCFs and BAMs on a collection of samples with some gaps, start with an ExperimentHub consisting of path specifications and on this you could derive some basic statistics on data availability. You'd want to have a little more detail on the biology from which the files arose early on, to help organize the high-level description. For example, I assume you might have separate VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different cell types, and from some ChIP-seq ... some samples have all, some have only a few of these assays, and spelling all this out at an early stage would be very useful. On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Dear all,
Here is a problem that has been bouncing around in my head, and I
thought it might be time for some discussion. Maybe others have
already figured this out.
We are often interested in the same genomic regions over multiple
datasets and multiple samples. Typically, the data are the output of
a large analysis pipeline. On the surface, SummarizedExperiment
is very close to the right data structure, but there are some issues.
Often, these data will be too large to load completely into memory,
so we need objects that point to out-of-memory storage. This would
need to be matrix-like, like a BamViews object, but there would be
redundancy between the ranges in the BamViews and the ranges in
the rowData. Thus, the BamViews could be created from a
BamFileList dynamically when the user retrieves an assay, or there
would need to be consistency checking to make sure the same ranges
are being described (would be a performance drain).
Another issue is that certain samples may only be included in
certain assays. In the simple matrix case, we could handle this with
NA values. The out-of-memory references will need to support a
similar semantic. So far, we have not allowed NA in the List
classes, but I think we might have to move in the direction. In some
ways, we are stretching the definition of SE here, because we might
have multiple experiments, not just one.
Perhaps we are no longer talking about a summary but are focusing
more on integration, i.e., we are talking about an
IntegratedExperiment. But I think SummarizedExperiment could be
coerced into this role.
Let's get this started with a use-case, here is one related to variant
calling:
Assume we have some output from a sequence analysis pipeline,
including alignments, coverage and variant calls. We want to
validate exome variants in RNA, but only where genes are expressed
(high coverage in RNA). Now assume that a SE has been constructed
for the exome variant positions and all of the samples. The assays
are the exome calls (VCF), the RNA calls (VCF), and the RNA
coverage (BigWig). The algorithm needs to extract the variant
information as GRanges, and the coverage information as an Rle.
First, we extract the exome variants:
> exome.variants <- assay(se, "exome.variants")
What would exome.variants be? In oncology at least, it is way more
efficient to output a VCF per sample and then merge them at
analysis time. Let us assume that there is one VCF file per sample
and internally there is a VcfFileList (I think Vince has shown
something like this). The exome.variants object needs to carry
along the positions from the SE rowData. The minimum conversion
would be to something like a VcfViews object (as in BamViews). The
VcfViews object should try to provide the same API has VCF, where
it makes sense. There are obvious issues like, would the column
indexing be by sample or by file? Conceptually at least, the
VcfViews is going to be very similar to a merge of multiple VCF
files into a VCF object. Would the return value really be a
VcfViews, or could it coerced directly to a VCF? The coercion may
be complicated, so it may be best to leave that as a second step,
after pulling out the assay.
Alternatively, if there is a single VCF file, the data could be
stored as a VCF, since it is matrix-like, after all. So SE's could
be nested. This would obviously be most efficient space-wise if the
VCF class were implemented on top of a tabix-indexed VCF, with
on-demand materialization. But maybe it is simpler to just use a
length-one VcfFileList/VcfViews for this? (As an aside, it would be
nice if there were some general abstraction for variant data,
whether stored in VCF, GVF, or some other format/database).
Then for the coverage:
> rna.coverage <- assay(se, "rna.coverage")
Following the conventions above, rna.coverage would be a
BigWigViews, which might have an API like viewSums, viewMaxs, etc
for getting back a matrix of coverage summaries, possibly as a
SummarizedExperiment?
So that's all I have for now.
Thanks,
Michael
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
So here is my 2 cents. Perhaps a bit rambling, but it is probably better to weigh in now. There are two issues here. One has to do with the right representation of multiple classes of experiments and one has to do with having the data on disk instead of in memory. The last one first. Michael is right to note that this (having an on-disk representation) would be highly useful. Some class that has a pointer to a file and perhaps a getData method which would pick out a region and return a SummarizedExperiment would be great. This would need to support BAM, bigWig and bigBed at least, and allow for each sample being in a different file. This is very much what we tried to do with Genominator, except that we used a special file format instead of just being able to point to different file types. Now for the first one. The use case I see is where you have a number of assays on the same individuals but also a number of (different or the same) assays on other samples. Let us for example say that you have done RNA-seq on some people and you want to look at ENCODE chip-seq data in that region. I think of this as a _collection_ of SummarizedExperiments. A collection because all assays in a SummarizedExperiment need to share the same ranges. And if you really have different assays, you may have copy number (where each range is likely to be long), seq. expression and chipseq. They all have different types of structure. One solution to bring them all into a set of shared ranges is to essentially do a disjoint on the ranges, but I don't like that. I think it will be important to store a single long copy number change as a single range and not as a union of ranges. I think it is important to allow different samples for different experiments (and in fact I think this will be more common - say you want to contrast you data with other public data in the same region - this is unlikely to be the same samples). And I don't think this should be done by having a lot of NA's in the matrices. So I think we need something like a list of SummarizedExperiments, perhaps with a joint sampleData (how a joint sample data is mapped to multiple assays will need to be thought about). We might also have a joint GRanges which signifies "this is the region(s) we have data on", but we should still retain the individual ranges for each experiment. Something like dataRanges A GRanges, just telling up what is essentially the union of the rowData in the SummarizedExperiment below, or perhaps bigger. copyNumber SummarizedExperiment, 3 samples has assays "copyNumber" and perhaps "control" TF binding SummarizedExperiment, 10 samples has assays "TF1", .., "TF5" and "input" SampleData some kind of joint sample phenodata. Kasper
On Fri, Sep 21, 2012 at 9:50 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 09/20/2012 06:47 PM, Michael Lawrence wrote:
Thanks Vince, I think we're on the same page. I agree that a set of ranges-of-interest are not always appropriate, and that the most basic structure would be a table of samples X assays, with missing values. Ranges-of-interest can be layered on top when desired. There are many aspects of SummarizedExperiment that I would want to carry over, especially the idea of metadata, on the samples, assays, and features (when applicable). Michael On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey <stvjc at channing.harvard.edu>wrote:
I'll comment briefly because I think this is a strategically important topic and I have done a little bit on integration in various forms. My view of SummarizedExperiment is that it updates the eSet concept to promote range-based indexing of assay features. The 'assays' component is limited to matrix/array like things and my sense is that the
It might help to nail down a more precise 'API' for what can be in the assays slot, but I think it would be definitely array-like; no need for it to be an actual 'matrix', though.
"Summarized" implies that the intention is for a memory-tractable, serializable reduction of an experiment applied to all of a fixed set of samples. I felt that what Michael was describing departs significantly from these conditions/aims in various ways -- there are multiple assays, possibly at different stages of summarization, and one wants a coherent path to interaction with these, requiring less uniformity of structure. Entities to be covered are, roughly, a set of biological
A major task I think would be management of on-disk resources, guaranteeing in some way that the object is not tied to some fragile local disk structure. The heterogeneity of data types also seems like a significant departure.
samples, mostly assayed in the same ways, but the assays do not imply a common set of measurements on a fixed set of ranges. One possible term for the data structure described by Michael is "ExperimentHub". This
a nice term. Martin
would include references to various external data resources and it would have methods for traversing the resources for certain objectives. Instead of nesting the SummarizedExperiment structures, we could think of certain traversals culminating in SummarizedExperiment instances. I think this would lead to high-level workflow prescriptions that could be broadly applicable -- say you have VCFs and BAMs on a collection of samples with some gaps, start with an ExperimentHub consisting of path specifications and on this you could derive some basic statistics on data availability. You'd want to have a little more detail on the biology from which the files arose early on, to help organize the high-level description. For example, I assume you might have separate VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different cell types, and from some ChIP-seq ... some samples have all, some have only a few of these assays, and spelling all this out at an early stage would be very useful. On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Dear all,
Here is a problem that has been bouncing around in my head, and I
thought it might be time for some discussion. Maybe others have
already figured this out.
We are often interested in the same genomic regions over multiple
datasets and multiple samples. Typically, the data are the output of
a large analysis pipeline. On the surface, SummarizedExperiment
is very close to the right data structure, but there are some issues.
Often, these data will be too large to load completely into memory,
so we need objects that point to out-of-memory storage. This would
need to be matrix-like, like a BamViews object, but there would be
redundancy between the ranges in the BamViews and the ranges in
the rowData. Thus, the BamViews could be created from a
BamFileList dynamically when the user retrieves an assay, or there
would need to be consistency checking to make sure the same ranges
are being described (would be a performance drain).
Another issue is that certain samples may only be included in
certain assays. In the simple matrix case, we could handle this with
NA values. The out-of-memory references will need to support a
similar semantic. So far, we have not allowed NA in the List
classes, but I think we might have to move in the direction. In some
ways, we are stretching the definition of SE here, because we might
have multiple experiments, not just one.
Perhaps we are no longer talking about a summary but are focusing
more on integration, i.e., we are talking about an
IntegratedExperiment. But I think SummarizedExperiment could be
coerced into this role.
Let's get this started with a use-case, here is one related to variant
calling:
Assume we have some output from a sequence analysis pipeline,
including alignments, coverage and variant calls. We want to
validate exome variants in RNA, but only where genes are expressed
(high coverage in RNA). Now assume that a SE has been constructed
for the exome variant positions and all of the samples. The assays
are the exome calls (VCF), the RNA calls (VCF), and the RNA
coverage (BigWig). The algorithm needs to extract the variant
information as GRanges, and the coverage information as an Rle.
First, we extract the exome variants:
> exome.variants <- assay(se, "exome.variants")
What would exome.variants be? In oncology at least, it is way more
efficient to output a VCF per sample and then merge them at
analysis time. Let us assume that there is one VCF file per sample
and internally there is a VcfFileList (I think Vince has shown
something like this). The exome.variants object needs to carry
along the positions from the SE rowData. The minimum conversion
would be to something like a VcfViews object (as in BamViews). The
VcfViews object should try to provide the same API has VCF, where
it makes sense. There are obvious issues like, would the column
indexing be by sample or by file? Conceptually at least, the
VcfViews is going to be very similar to a merge of multiple VCF
files into a VCF object. Would the return value really be a
VcfViews, or could it coerced directly to a VCF? The coercion may
be complicated, so it may be best to leave that as a second step,
after pulling out the assay.
Alternatively, if there is a single VCF file, the data could be
stored as a VCF, since it is matrix-like, after all. So SE's could
be nested. This would obviously be most efficient space-wise if the
VCF class were implemented on top of a tabix-indexed VCF, with
on-demand materialization. But maybe it is simpler to just use a
length-one VcfFileList/VcfViews for this? (As an aside, it would be
nice if there were some general abstraction for variant data,
whether stored in VCF, GVF, or some other format/database).
Then for the coverage:
> rna.coverage <- assay(se, "rna.coverage")
Following the conventions above, rna.coverage would be a
BigWigViews, which might have an API like viewSums, viewMaxs, etc
for getting back a matrix of coverage summaries, possibly as a
SummarizedExperiment?
So that's all I have for now.
Thanks,
Michael
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120921/ea630356/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120921/ace703ed/attachment.pl>