[Bioc-devel] SummarizedExperiment: potential for data integration and meta-analysis?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120920/93a4fb61/attachment.pl>
I'll comment briefly because I think this is a strategically important topic and
I have done a little bit on integration in various forms.

My view of SummarizedExperiment is that it updates the eSet concept to
promote range-based indexing of assay features.  The 'assays' component
is limited to matrix/array like things and my sense is that the "Summarized"
implies that the intention is for a memory-tractable, serializable reduction of
an experiment applied to all of a fixed set of samples.

I felt that what Michael was describing departs significantly from
these conditions/aims
in various ways -- there are multiple assays, possibly at different
stages of summarization, and one
wants a coherent path to interaction with these, requiring less uniformity of
structure.  Entities to be covered are, roughly, a set of biological
samples, mostly assayed in the same ways, but the assays do not imply a common
set of measurements on a fixed set of ranges.

One possible term for the data structure described by Michael is
"ExperimentHub".  This
would include references to various external data resources and it
would have methods
for traversing the resources for certain objectives.  Instead of
nesting the SummarizedExperiment
structures, we could think of certain traversals culminating in
SummarizedExperiment instances.

I think this would lead to high-level workflow prescriptions that
could be broadly applicable --
say you have VCFs and BAMs on a collection of samples with some gaps,
start with an ExperimentHub
consisting of path specifications and on this you could derive some
basic statistics on data availability.  You'd want to have a little
more detail on the biology from which the files arose early on, to
help organize the
high-level description.  For example, I assume you might have separate
VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
cell types, and from some ChIP-seq ... some samples have all, some
have only a few of these assays, and spelling all this out at an early
stage would be very useful.

On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
Dear all,

Here is a problem that has been bouncing around in my head, and I
thought it might be time for some discussion. Maybe others have
already figured this out.

  We are often interested in the same genomic regions over multiple
  datasets and multiple samples. Typically, the data are the output of
  a large analysis pipeline. On the surface, SummarizedExperiment
  is very close to the right data structure, but there are some issues.

  Often, these data will be too large to load completely into memory,
  so we need objects that point to out-of-memory storage. This would
  need to be matrix-like, like a BamViews object, but there would be
  redundancy between the ranges in the BamViews and the ranges in
  the rowData. Thus, the BamViews could be created from a
  BamFileList dynamically when the user retrieves an assay, or there
  would need to be consistency checking to make sure the same ranges
  are being described (would be a performance drain).

  Another issue is that certain samples may only be included in
  certain assays. In the simple matrix case, we could handle this with
  NA values. The out-of-memory references will need to support a
  similar semantic. So far, we have not allowed NA in the List
  classes, but I think we might have to move in the direction. In some
  ways, we are stretching the definition of SE here, because we might
  have multiple experiments, not just one.

  Perhaps we are no longer talking about a summary but are focusing
  more on integration, i.e., we are talking about an
  IntegratedExperiment. But I think SummarizedExperiment could be
  coerced into this role.

Let's get this started with a use-case, here is one related to variant
calling:

   Assume we have some output from a sequence analysis pipeline,
   including alignments, coverage and variant calls. We want to
   validate exome variants in RNA, but only where genes are expressed
   (high coverage in RNA). Now assume that a SE has been constructed
   for the exome variant positions and all of the samples. The assays
   are the exome calls (VCF), the RNA calls (VCF), and the RNA
   coverage (BigWig). The algorithm needs to extract the variant
   information as GRanges, and the coverage information as an Rle.

   First, we extract the exome variants:

   > exome.variants <- assay(se, "exome.variants")
   What would exome.variants be? In oncology at least, it is way more
   efficient to output a VCF per sample and then merge them at
   analysis time. Let us assume that there is one VCF file per sample
   and internally there is a VcfFileList (I think Vince has shown
   something like this). The exome.variants object needs to carry
   along the positions from the SE rowData. The minimum conversion
   would be to something like a VcfViews object (as in BamViews). The
   VcfViews object should try to provide the same API has VCF, where
   it makes sense. There are obvious issues like, would the column
   indexing be by sample or by file? Conceptually at least, the
   VcfViews is going to be very similar to a merge of multiple VCF
   files into a VCF object. Would the return value really be a
   VcfViews, or could it coerced directly to a VCF? The coercion may
   be complicated, so it may be best to leave that as a second step,
   after pulling out the assay.

   Alternatively, if there is a single VCF file, the data could be
   stored as a VCF, since it is matrix-like, after all. So SE's could
   be nested. This would obviously be most efficient space-wise if the
   VCF class were implemented on top of a tabix-indexed VCF, with
   on-demand materialization. But maybe it is simpler to just use a
   length-one VcfFileList/VcfViews for this? (As an aside, it would be
   nice if there were some general abstraction for variant data,
   whether stored in VCF, GVF, or some other format/database).

   Then for the coverage:

   > rna.coverage <- assay(se, "rna.coverage")
   Following the conventions above, rna.coverage would be a
   BigWigViews, which might have an API like viewSums, viewMaxs, etc
   for getting back a matrix of coverage summaries, possibly as a
   SummarizedExperiment?

So that's all I have for now.

Thanks,
Michael

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120920/780a7947/attachment.pl>
Thanks Vince,

I think we're on the same page. I agree that a set of ranges-of-interest
are not always appropriate, and that the most basic structure would be a
table of samples X assays, with missing values. Ranges-of-interest can be
layered on top when desired. There are many aspects of SummarizedExperiment
that I would want to carry over, especially the idea of metadata, on the
samples, assays, and features (when applicable).

Michael

On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey
<stvjc at channing.harvard.edu>wrote:

I'll comment briefly because I think this is a strategically important
topic and
I have done a little bit on integration in various forms.

My view of SummarizedExperiment is that it updates the eSet concept to
promote range-based indexing of assay features.  The 'assays' component
is limited to matrix/array like things and my sense is that the
It might help to nail down a more precise 'API' for what can be in the 
assays slot, but I think it would be definitely array-like; no need for 
it to be an actual 'matrix', though.
"Summarized"
implies that the intention is for a memory-tractable, serializable
reduction of
an experiment applied to all of a fixed set of samples.

I felt that what Michael was describing departs significantly from
these conditions/aims
in various ways -- there are multiple assays, possibly at different
stages of summarization, and one
wants a coherent path to interaction with these, requiring less uniformity
of
structure.  Entities to be covered are, roughly, a set of biological
A major task I think would be management of on-disk resources, 
guaranteeing in some way that the object is not tied to some fragile 
local disk structure.

The heterogeneity of data types also seems like a significant departure.
samples, mostly assayed in the same ways, but the assays do not imply a
common
set of measurements on a fixed set of ranges.

One possible term for the data structure described by Michael is
"ExperimentHub".  This
a nice term.

Martin
would include references to various external data resources and it
would have methods
for traversing the resources for certain objectives.  Instead of
nesting the SummarizedExperiment
structures, we could think of certain traversals culminating in
SummarizedExperiment instances.

I think this would lead to high-level workflow prescriptions that
could be broadly applicable --
say you have VCFs and BAMs on a collection of samples with some gaps,
start with an ExperimentHub
consisting of path specifications and on this you could derive some
basic statistics on data availability.  You'd want to have a little
more detail on the biology from which the files arose early on, to
help organize the
high-level description.  For example, I assume you might have separate
VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
cell types, and from some ChIP-seq ... some samples have all, some
have only a few of these assays, and spelling all this out at an early
stage would be very useful.

On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:
Dear all,

Here is a problem that has been bouncing around in my head, and I
thought it might be time for some discussion. Maybe others have
already figured this out.

   We are often interested in the same genomic regions over multiple
   datasets and multiple samples. Typically, the data are the output of
   a large analysis pipeline. On the surface, SummarizedExperiment
   is very close to the right data structure, but there are some issues.

   Often, these data will be too large to load completely into memory,
   so we need objects that point to out-of-memory storage. This would
   need to be matrix-like, like a BamViews object, but there would be
   redundancy between the ranges in the BamViews and the ranges in
   the rowData. Thus, the BamViews could be created from a
   BamFileList dynamically when the user retrieves an assay, or there
   would need to be consistency checking to make sure the same ranges
   are being described (would be a performance drain).

   Another issue is that certain samples may only be included in
   certain assays. In the simple matrix case, we could handle this with
   NA values. The out-of-memory references will need to support a
   similar semantic. So far, we have not allowed NA in the List
   classes, but I think we might have to move in the direction. In some
   ways, we are stretching the definition of SE here, because we might
   have multiple experiments, not just one.

   Perhaps we are no longer talking about a summary but are focusing
   more on integration, i.e., we are talking about an
   IntegratedExperiment. But I think SummarizedExperiment could be
   coerced into this role.

Let's get this started with a use-case, here is one related to variant
calling:

    Assume we have some output from a sequence analysis pipeline,
    including alignments, coverage and variant calls. We want to
    validate exome variants in RNA, but only where genes are expressed
    (high coverage in RNA). Now assume that a SE has been constructed
    for the exome variant positions and all of the samples. The assays
    are the exome calls (VCF), the RNA calls (VCF), and the RNA
    coverage (BigWig). The algorithm needs to extract the variant
    information as GRanges, and the coverage information as an Rle.

    First, we extract the exome variants:

    > exome.variants <- assay(se, "exome.variants")
    What would exome.variants be? In oncology at least, it is way more
    efficient to output a VCF per sample and then merge them at
    analysis time. Let us assume that there is one VCF file per sample
    and internally there is a VcfFileList (I think Vince has shown
    something like this). The exome.variants object needs to carry
    along the positions from the SE rowData. The minimum conversion
    would be to something like a VcfViews object (as in BamViews). The
    VcfViews object should try to provide the same API has VCF, where
    it makes sense. There are obvious issues like, would the column
    indexing be by sample or by file? Conceptually at least, the
    VcfViews is going to be very similar to a merge of multiple VCF
    files into a VCF object. Would the return value really be a
    VcfViews, or could it coerced directly to a VCF? The coercion may
    be complicated, so it may be best to leave that as a second step,
    after pulling out the assay.

    Alternatively, if there is a single VCF file, the data could be
    stored as a VCF, since it is matrix-like, after all. So SE's could
    be nested. This would obviously be most efficient space-wise if the
    VCF class were implemented on top of a tabix-indexed VCF, with
    on-demand materialization. But maybe it is simpler to just use a
    length-one VcfFileList/VcfViews for this? (As an aside, it would be
    nice if there were some general abstraction for variant data,
    whether stored in VCF, GVF, or some other format/database).

    Then for the coverage:

    > rna.coverage <- assay(se, "rna.coverage")
    Following the conventions above, rna.coverage would be a
    BigWigViews, which might have an API like viewSums, viewMaxs, etc
    for getting back a matrix of coverage summaries, possibly as a
    SummarizedExperiment?

So that's all I have for now.

Thanks,
Michael

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793
So here is my 2 cents.  Perhaps a bit rambling, but it is probably
better to weigh in now.

There are two issues here.  One has to do with the right
representation of multiple classes of experiments and one has to do
with having the data on disk instead of in memory.

The last one first.  Michael is right to note that this (having an
on-disk representation) would be highly useful.  Some class that has a
pointer to a file and perhaps a getData method which would pick out a
region and return a SummarizedExperiment would be great.  This would
need to support BAM, bigWig and bigBed at least, and allow for each
sample being in a different file.  This is very much what we tried to
do with Genominator, except that we used a special file format instead
of just being able to point to different file types.

Now for the first one.  The use case I see is where you have a number
of assays on the same individuals but also a number of (different or
the same) assays on other samples.  Let us for example say that you
have done RNA-seq on some people and you want to look at ENCODE
chip-seq data in that region.

I think of this as a _collection_ of SummarizedExperiments.  A
collection because all assays in a SummarizedExperiment need to share
the same ranges.  And if you really have different assays, you may
have copy number (where each range is likely to be long), seq.
expression and chipseq.  They all have different types of structure.
One solution to bring them all into a set of shared ranges is to
essentially do a disjoint on the ranges, but I don't like that.  I
think it will be important to store a single long copy number change
as a single range and not as a union of ranges.

I think it is important to allow different samples for different
experiments (and in fact I think this will be more common - say you
want to contrast you data with other public data in the same region -
this is unlikely to be the same samples).  And I don't think this
should be done by having a lot of NA's in the matrices.

So I think we need something like a list of SummarizedExperiments,
perhaps with a joint sampleData (how a joint sample data is mapped to
multiple assays will need to be thought about).  We might also have a
joint GRanges which signifies "this is the region(s) we have data on",
but we should still retain the individual ranges for each experiment.
Something like

dataRanges
  A GRanges, just telling up what is essentially the union of the
rowData in the SummarizedExperiment below, or perhaps bigger.
copyNumber
  SummarizedExperiment, 3 samples
  has assays "copyNumber" and perhaps "control"
TF binding
  SummarizedExperiment, 10 samples
  has assays "TF1", .., "TF5" and "input"
SampleData
  some kind of joint sample phenodata.

Kasper
On 09/20/2012 06:47 PM, Michael Lawrence wrote:
Thanks Vince,

I think we're on the same page. I agree that a set of ranges-of-interest
are not always appropriate, and that the most basic structure would be a
table of samples X assays, with missing values. Ranges-of-interest can be
layered on top when desired. There are many aspects of
SummarizedExperiment
that I would want to carry over, especially the idea of metadata, on the
samples, assays, and features (when applicable).

Michael

On Thu, Sep 20, 2012 at 9:30 AM, Vincent Carey
<stvjc at channing.harvard.edu>wrote:

I'll comment briefly because I think this is a strategically important
topic and
I have done a little bit on integration in various forms.

My view of SummarizedExperiment is that it updates the eSet concept to
promote range-based indexing of assay features.  The 'assays' component
is limited to matrix/array like things and my sense is that the

It might help to nail down a more precise 'API' for what can be in the
assays slot, but I think it would be definitely array-like; no need for it
to be an actual 'matrix', though.

"Summarized"
implies that the intention is for a memory-tractable, serializable
reduction of
an experiment applied to all of a fixed set of samples.

I felt that what Michael was describing departs significantly from
these conditions/aims
in various ways -- there are multiple assays, possibly at different
stages of summarization, and one
wants a coherent path to interaction with these, requiring less
uniformity
of
structure.  Entities to be covered are, roughly, a set of biological

A major task I think would be management of on-disk resources, guaranteeing
in some way that the object is not tied to some fragile local disk
structure.

The heterogeneity of data types also seems like a significant departure.

samples, mostly assayed in the same ways, but the assays do not imply a
common
set of measurements on a fixed set of ranges.

One possible term for the data structure described by Michael is
"ExperimentHub".  This

a nice term.

Martin

would include references to various external data resources and it
would have methods
for traversing the resources for certain objectives.  Instead of
nesting the SummarizedExperiment
structures, we could think of certain traversals culminating in
SummarizedExperiment instances.

I think this would lead to high-level workflow prescriptions that
could be broadly applicable --
say you have VCFs and BAMs on a collection of samples with some gaps,
start with an ExperimentHub
consisting of path specifications and on this you could derive some
basic statistics on data availability.  You'd want to have a little
more detail on the biology from which the files arose early on, to
help organize the
high-level description.  For example, I assume you might have separate
VCFs on germ-line and tumor DNA, BAM from RNA-seq applied to different
cell types, and from some ChIP-seq ... some samples have all, some
have only a few of these assays, and spelling all this out at an early
stage would be very useful.

On Thu, Sep 20, 2012 at 9:18 AM, Michael Lawrence
<lawrence.michael at gene.com> wrote:

Dear all,

Here is a problem that has been bouncing around in my head, and I
thought it might be time for some discussion. Maybe others have
already figured this out.

   We are often interested in the same genomic regions over multiple
   datasets and multiple samples. Typically, the data are the output of
   a large analysis pipeline. On the surface, SummarizedExperiment
   is very close to the right data structure, but there are some issues.

   Often, these data will be too large to load completely into memory,
   so we need objects that point to out-of-memory storage. This would
   need to be matrix-like, like a BamViews object, but there would be
   redundancy between the ranges in the BamViews and the ranges in
   the rowData. Thus, the BamViews could be created from a
   BamFileList dynamically when the user retrieves an assay, or there
   would need to be consistency checking to make sure the same ranges
   are being described (would be a performance drain).

   Another issue is that certain samples may only be included in
   certain assays. In the simple matrix case, we could handle this with
   NA values. The out-of-memory references will need to support a
   similar semantic. So far, we have not allowed NA in the List
   classes, but I think we might have to move in the direction. In some
   ways, we are stretching the definition of SE here, because we might
   have multiple experiments, not just one.

   Perhaps we are no longer talking about a summary but are focusing
   more on integration, i.e., we are talking about an
   IntegratedExperiment. But I think SummarizedExperiment could be
   coerced into this role.

Let's get this started with a use-case, here is one related to variant
calling:

    Assume we have some output from a sequence analysis pipeline,
    including alignments, coverage and variant calls. We want to
    validate exome variants in RNA, but only where genes are expressed
    (high coverage in RNA). Now assume that a SE has been constructed
    for the exome variant positions and all of the samples. The assays
    are the exome calls (VCF), the RNA calls (VCF), and the RNA
    coverage (BigWig). The algorithm needs to extract the variant
    information as GRanges, and the coverage information as an Rle.

    First, we extract the exome variants:

    > exome.variants <- assay(se, "exome.variants")
    What would exome.variants be? In oncology at least, it is way more
    efficient to output a VCF per sample and then merge them at
    analysis time. Let us assume that there is one VCF file per sample
    and internally there is a VcfFileList (I think Vince has shown
    something like this). The exome.variants object needs to carry
    along the positions from the SE rowData. The minimum conversion
    would be to something like a VcfViews object (as in BamViews). The
    VcfViews object should try to provide the same API has VCF, where
    it makes sense. There are obvious issues like, would the column
    indexing be by sample or by file? Conceptually at least, the
    VcfViews is going to be very similar to a merge of multiple VCF
    files into a VCF object. Would the return value really be a
    VcfViews, or could it coerced directly to a VCF? The coercion may
    be complicated, so it may be best to leave that as a second step,
    after pulling out the assay.

    Alternatively, if there is a single VCF file, the data could be
    stored as a VCF, since it is matrix-like, after all. So SE's could
    be nested. This would obviously be most efficient space-wise if the
    VCF class were implemented on top of a tabix-indexed VCF, with
    on-demand materialization. But maybe it is simpler to just use a
    length-one VcfFileList/VcfViews for this? (As an aside, it would be
    nice if there were some general abstraction for variant data,
    whether stored in VCF, GVF, or some other format/database).

    Then for the coverage:

    > rna.coverage <- assay(se, "rna.coverage")
    Following the conventions above, rna.coverage would be a
    BigWigViews, which might have an API like viewSums, viewMaxs, etc
    for getting back a matrix of coverage summaries, possibly as a
    SummarizedExperiment?

So that's all I have for now.

Thanks,
Michael

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120921/ea630356/attachment.pl>
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/bioc-devel/attachments/20120921/ace703ed/attachment.pl>