Motivated by the discussion thread from November (https://stat.ethz.ch/ pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core team is planning on making changes to the SummarizedExperiment class. Our end goal is to allow the @rowData slot to become more flexible and hold either a DataFrame or GRanges type object. To this end we have currently deprecated the current rowData accessor in favor of a rowRanges accessor. This change has resulted in a few broken builds in devel, which we are in the process of fixing now. We will contact any package authors directly if needed for this migration. The rowData accessor will be deprecated in this release, however eventually the plan is to re-purpose this function to serve as an accessor for DataFrame data on the rows. Please let us know if you have any questions with the above and if you need any assistance with the transition.
[Bioc-devel] Changes to the SummarizedExperiment Class
16 messages · Jim Hester, Gabriel Becker, Michael Lawrence +7 more
Jim et al., Why have two accessors (rowRanges, rowData), each of which are less flexible than the underlying structure and thus will fail (return NULL? or GRanges()/DataFrame() ?) in some proportion of valid objects? ~G
On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester <james.f.hester at gmail.com> wrote:
Motivated by the discussion thread from November (https://stat.ethz.ch/ pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core team is planning on making changes to the SummarizedExperiment class. Our end goal is to allow the @rowData slot to become more flexible and hold either a DataFrame or GRanges type object. To this end we have currently deprecated the current rowData accessor in favor of a rowRanges accessor. This change has resulted in a few broken builds in devel, which we are in the process of fixing now. We will contact any package authors directly if needed for this migration. The rowData accessor will be deprecated in this release, however eventually the plan is to re-purpose this function to serve as an accessor for DataFrame data on the rows. Please let us know if you have any questions with the above and if you need any assistance with the transition. [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]]
Seems like rowData could be made to work universallly through coercion. rowRanges would not, however, and one would like a convenient mechanism to condition on whether range information is available. One way is to introduce a new class and rely on dispatch. But that adds complexity.
On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker <becker.gabe at gene.com> wrote:
Jim et al., Why have two accessors (rowRanges, rowData), each of which are less flexible than the underlying structure and thus will fail (return NULL? or GRanges()/DataFrame() ?) in some proportion of valid objects? ~G On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester <james.f.hester at gmail.com> wrote:
Motivated by the discussion thread from November (https://stat.ethz.ch/ pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core
team
is planning on making changes to the SummarizedExperiment class. Our end goal is to allow the @rowData slot to become more flexible and hold
either
a DataFrame or GRanges type object. To this end we have currently deprecated the current rowData accessor in favor of a rowRanges accessor. This change has resulted in a few broken builds in devel, which we are in the process of fixing now. We will contact any package authors directly if needed for this migration. The rowData accessor will be deprecated in this release, however
eventually
the plan is to re-purpose this function to serve as an accessor for DataFrame data on the rows. Please let us know if you have any questions with the above and if you
need
any assistance with the transition.
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
I'd like to see a basic class that takes a DataFrame and a sub-class that takes a GRanges. I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. While the hood is up, can we try some different names? SummarizedExperiment never seemed like a great fit to me because it doesn't necessarily contain experiments or summaries thereof. It's a collection of like-sized rectangular things with metadata on the two dimensions. Maybe the name could reflect what it holds rather than a common use case? AnnotatedMatrixList? Anyway, I'm excited to see a version on the way that takes a DataFrame as rowData. I'm glad you guys are working on that. Regards, Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 2:57 PM, Michael Lawrence <lawrence.michael at gene.com> wrote:
Seems like rowData could be made to work universallly through coercion. rowRanges would not, however, and one would like a convenient mechanism to condition on whether range information is available. One way is to introduce a new class and rely on dispatch. But that adds complexity. On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker <becker.gabe at gene.com> wrote:
Jim et al., Why have two accessors (rowRanges, rowData), each of which are less flexible than the underlying structure and thus will fail (return NULL?
or
GRanges()/DataFrame() ?) in some proportion of valid objects? ~G On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester <james.f.hester at gmail.com> wrote:
Motivated by the discussion thread from November (
pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core
team
is planning on making changes to the SummarizedExperiment class. Our
end
goal is to allow the @rowData slot to become more flexible and hold
either
a DataFrame or GRanges type object. To this end we have currently deprecated the current rowData accessor
in
favor of a rowRanges accessor. This change has resulted in a few
broken
builds in devel, which we are in the process of fixing now. We will contact any package authors directly if needed for this migration. The rowData accessor will be deprecated in this release, however
eventually
the plan is to re-purpose this function to serve as an accessor for DataFrame data on the rows. Please let us know if you have any questions with the above and if you
need
any assistance with the transition.
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Gabriel Becker, Ph.D
Computational Biologist
Genentech Research
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
On 03/03/2015 03:06 PM, Peter Haverty wrote:
I'd like to see a basic class that takes a DataFrame and a sub-class that takes a GRanges.
Yes.
I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument.
Just impossible. As Michael mentioned back in November, they have conflicting APIs.
While the hood is up, can we try some different names? SummarizedExperiment never seemed like a great fit to me because it doesn't necessarily contain experiments or summaries thereof. It's a collection of like-sized rectangular things with metadata on the two dimensions. Maybe the name could reflect what it holds rather than a common use case? AnnotatedMatrixList?
We actually need 2 names: 1 for the parent class, 1 for the child. I'm starting to think that introducing 2 new names would maybe make the migration a little bit easier, especially since the plan is to move the "refactored SummarizedExperiment" to its own package. With 2 new names we can start the new package, implement the 2 new classes in it, and have the old SummarizedExperiment (in GenomicRanges) and the 2 new classes peacefully cohabit during the time of the migration. Cheers, H.
Anyway, I'm excited to see a version on the way that takes a DataFrame as rowData. I'm glad you guys are working on that. Regards, Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 2:57 PM, Michael Lawrence <lawrence.michael at gene.com> wrote: Seems like rowData could be made to work universallly through coercion. rowRanges would not, however, and one would like a convenient mechanism to condition on whether range information is available. One way is to introduce a new class and rely on dispatch. But that adds complexity. On Tue, Mar 3, 2015 at 2:44 PM, Gabe Becker <becker.gabe at gene.com> wrote: Jim et al., Why have two accessors (rowRanges, rowData), each of which are less flexible than the underlying structure and thus will fail (return NULL? or GRanges()/DataFrame() ?) in some proportion of valid objects? ~G On Tue, Mar 3, 2015 at 2:37 PM, Jim Hester <james.f.hester at gmail.com> wrote: Motivated by the discussion thread from November ( https://stat.ethz.ch/ pipermail/bioc-devel/2014-November/006686.html) the Bioconductor core team is planning on making changes to the SummarizedExperiment class. Our end goal is to allow the @rowData slot to become more flexible and hold either a DataFrame or GRanges type object. To this end we have currently deprecated the current rowData accessor in favor of a rowRanges accessor. This change has resulted in a few broken builds in devel, which we are in the process of fixing now. We will contact any package authors directly if needed for this migration. The rowData accessor will be deprecated in this release, however eventually the plan is to re-purpose this function to serve as an accessor for DataFrame data on the rows. Please let us know if you have any questions with the above and if you need any assistance with the transition. [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel -- Gabriel Becker, Ph.D Computational Biologist Genentech Research [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that argument.
Just impossible. As Michael mentioned back in November, they have conflicting APIs.
Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges (without mcols) as an index?
This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote:
I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that
argument.
Just impossible. As Michael mentioned back in November, they have conflicting APIs.
Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
There are some nice similarities in these new imaginary types. A "GRangesFrame" is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote:
This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote:
I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that
argument.
Just impossible. As Michael mentioned back in November, they have conflicting APIs.
Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D "[". Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com> wrote:
There are some nice similarities in these new imaginary types. A "GRangesFrame" is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote: I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting APIs. Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges (without mcols) as an index? [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D "[". Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com> wrote:
There are some nice similarities in these new imaginary types. A "GRangesFrame" is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote: I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting APIs. Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges (without mcols) as an index? [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can return whatever makes sense (GRanges, or other data structures -thinking taxonomy for metagenomics for example-). GRangesFrame can inherit from this.
On Wed, Mar 4, 2015 at 3:28 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D "[". Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com> wrote: There are some nice similarities in these new imaginary types. A
"GRangesFrame" is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote: I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting APIs. Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges (without mcols) as an index? [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
I am a bit concerned about any major alterations to the SummarizedExperiment API. We have two papers and plenty of working code that use it in meaningful ways. Effort required to keep new formulations back-compatible as well as bug-free has to be weighed seriously. I agree that the name is not ideal. We are learning as we go. Seems to make sense to start with the contracts we want the instances of a class to satisfy. I have long felt that X[i, j] idiom is one users and developers should be comfortable with, even insist on, and for consistency with matrix operations idiom, it should work in a natural way for numeric indexing. This seems like an important constraint. subsetBy* is a useful idiom, but it is conceivable that we would adopt filter() for row-oriented selections and select() for column-oriented selections. Do we have to make any special design considerations to allow very smooth interoperation with out-of-memory resources for certain components for developers who want to allow this? We should have a reasonable way to get data on what is out there, what is used, how it is most effectively used. What's the SE API? Is it well-adapted to requirements of DESeq2? Other killer packages that use/don't use it? Even getting data on the formal API for a class is not all that familiar. And if folks are writing non-S4 interfaces (i.e., naked functions) we have no way of identifying them. See below for one way of discovering the API for SummarizedExperiment. In summary, I think we have to be careful about overdesigning too early. Getting clear on contracts seems the best way to ensure reuse, and we really want that so that reliability is continually assessed. My sense is that it is good to give developers something they'll gladly extend, not necessarily reuse directly. So we don't have to have broad consensus on class details, but on the minimal abstraction and on obligatory tests on its basic implementation.
methods(class="SummarizedExperiment") # perhaps an obsolete version of
methods cataloguer by MTM
DataFrame with 76 rows and 3 columns
generic
signature package
<character>
<character> <character>
1 [ x="SummarizedExperiment", i="ANY",
j="ANY", drop="ANY" base
2 [ x="SummarizedExperiment", i="ANY",
j="missing", value="ANY" base
3 [ x="SummarizedExperiment",
i="ANY", j="missing" base
4 [<- x="SummarizedExperiment", i="ANY", j="ANY",
value="SummarizedExperiment" base
5 assay x="SummarizedExperiment",
i="character" GenomicRanges
... ...
... ...
72 updateObject
object="SummarizedExperiment" BiocGenerics
73 values
x="SummarizedExperiment" S4Vectors
74 values<-
x="SummarizedExperiment" S4Vectors
75 width
x="SummarizedExperiment" BiocGenerics
76 width<-
x="SummarizedExperiment" BiocGenerics
On Wed, Mar 4, 2015 at 8:32 AM, Hector Corrada Bravo <hcorrada at gmail.com>
wrote:
May I advocate for 'IndexedDataFrame' or 'IndexedFrame'? 'rowIndices' can return whatever makes sense (GRanges, or other data structures -thinking taxonomy for metagenomics for example-). GRangesFrame can inherit from this. On Wed, Mar 4, 2015 at 3:28 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as
they
satisfy a simple contract of NROW() and 2D "[". Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com> wrote: There are some nice similarities in these new imaginary types. A
"GRangesFrame" is a list of dimensionally identical things (columns)
and
some row meta-data (the GRanges). The SE-like object is similarly a
list
of dimensionally like things (matrices, RleDataFrames, BigMatrix
objects,
HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or
become
an extension to this new SE-like thing. The extra stuff that comes
along
with genoset will still be available. Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote: This. It would be damned near perfect as a return value for assays coming
out
of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com
wrote:
I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that
argument.
Just impossible. As Michael mentioned back in November, they have conflicting APIs.
Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
I think we need to make sure that there are enough benefits of something like GRangesFrame before we introduce yet another complicated and overlapping data structure into the framework. Prior to summarization, the ranges seem primary, after summarization, it may often make sense for them to be secondary. But I'm just not sure what we gain from a new data structure.
On Wed, Mar 4, 2015 at 12:28 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D "[". Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com> wrote: There are some nice similarities in these new imaginary types. A
"GRangesFrame" is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote: I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting APIs. Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges (without mcols) as an index? [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
Michael has a good point. The complexity of the BioC universe of classes hurts our ability to attract new users. More classes would be a minus there ... but a small set of common, explicit APIs would simplify things. Rectangular things implement the matrix Interface. :-) Deprecating old stuff, like eSet, might help more than it hurts, on the simplicity front. P.S. apropos of understanding this universe of classes, I *love* the methods(class=x) thing Vincent mentioned. Pete ____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <lawrence.michael at gene.com> wrote:
I think we need to make sure that there are enough benefits of something like GRangesFrame before we introduce yet another complicated and overlapping data structure into the framework. Prior to summarization, the ranges seem primary, after summarization, it may often make sense for them to be secondary. But I'm just not sure what we gain from a new data structure. On Wed, Mar 4, 2015 at 12:28 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as they satisfy a simple contract of NROW() and 2D "[". Then, you could have a DataFrame, GRanges, or whatever in there. But it would be nice to have a special class for the container with range information. The contract for the range annotation would be to have a granges() method. I agree it would be nice if there was a way with the methods package to easily assert such contracts. For example, one could define an interface with a set of generics (and optionally the relevant position in the generic signature). Then, once all of the methods have been assigned for a particular class, it is made to inherit from that contract class. There are lots of gotchas though. Not sure how useful it would be in practice. On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com> wrote: There are some nice similarities in these new imaginary types. A
"GRangesFrame" is a list of dimensionally identical things (columns) and some row meta-data (the GRanges). The SE-like object is similarly a list of dimensionally like things (matrices, RleDataFrames, BigMatrix objects, HDF5-backed things) with some row meta-data (a DataFrame or GRangesFrame). Elegant? Maybe they would actually be relatives in the class tree. I wonder if this kind of thing would be easier if we had Java-style Interfaces or duck-typing. The "x" slot of "y" holds something that implements this set of methods ... Oh, and kinda apropos, the genoset class will probably go away or become an extension to this new SE-like thing. The extra stuff that comes along with genoset will still be available. Pete
____________________ Peter M. Haverty, Ph.D. Genentech, Inc. phaverty at gene.com On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com> wrote: This. It would be damned near perfect as a return value for assays coming out of an object that held several such assays at several time points in a population, where there are both assay-wise and covariate-wise "holes" that could nonetheless be usefully imputed across assays. Statistics is the grammar of science. Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science> On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com> wrote: I still think GRanges should be a subclass of DataFrame, which would make this easy, but I don't seem to be winning that argument. Just impossible. As Michael mentioned back in November, they have conflicting APIs. Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges (without mcols) as an index? [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] _______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319
On 03/04/2015 10:03 AM, Peter Haverty wrote:
Michael has a good point. The complexity of the BioC universe of classes hurts our ability to attract new users. More classes would be a minus there ... but a small set of common, explicit APIs would simplify things. Rectangular things implement the matrix Interface. :-) Deprecating old stuff, like eSet, might help more than it hurts, on the simplicity front. P.S. apropos of understanding this universe of classes, I *love* the methods(class=x) thing Vincent mentioned.
The current version, under R-devel, is at
devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4")
> methods(class="SummarizedExperiment")
[1] [ [[ [[<- [<-
[5] $ $<- assay assay<-
[9] assayNames assayNames<- assays assays<-
[13] cbind coerce colData colData<-
[17] compare Compare countOverlaps coverage
[21] dim dimnames dimnames<- disjointBins
[25] distance distanceToNearest duplicated elementMetadata
[29] elementMetadata<- end end<- exptData
[33] exptData<- extractROWS findOverlaps flank
[37] follow granges isDisjoint mcols
[41] mcols<- narrow nearest order
[45] overlapsAny precede ranges ranges<-
[49] rank rbind replaceROWS resize
[53] restrict rowData rowData<- seqinfo
[57] seqinfo<- seqnames shift show
[61] sort split start start<-
[65] strand strand<- subset subsetByOverlaps
[69] updateObject values values<- width
[73] width<-
see ?"methods" for accessing help and source code
and
> head(attr(methods(class="SummarizedExperiment"), "info"))
generic visible
[,SummarizedExperiment,ANY-method [ TRUE
[[,SummarizedExperiment,ANY,missing-method [[ TRUE
[[<-,SummarizedExperiment,ANY,missing-method [[<- TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [<- TRUE
$,SummarizedExperiment-method $ TRUE
$<-,SummarizedExperiment-method $<- TRUE
isS4 from
[,SummarizedExperiment,ANY-method TRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-method TRUE GenomicRanges
$<-,SummarizedExperiment-method TRUE GenomicRanges
Martin
Pete
____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com
On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <lawrence.michael at gene.com>
wrote:
I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.
On Wed, Mar 4, 2015 at 12:28 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as they
satisfy a simple contract of NROW() and 2D "[". Then, you could have a
DataFrame, GRanges, or whatever in there. But it would be nice to have a
special class for the container with range information. The contract for
the range annotation would be to have a granges() method.
I agree it would be nice if there was a way with the methods package to
easily assert such contracts. For example, one could define an interface
with a set of generics (and optionally the relevant position in the
generic
signature). Then, once all of the methods have been assigned for a
particular class, it is made to inherit from that contract class. There
are
lots of gotchas though. Not sure how useful it would be in practice.
On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com>
wrote:
There are some nice similarities in these new imaginary types. A
"GRangesFrame" is a list of dimensionally identical things (columns) and
some row meta-data (the GRanges). The SE-like object is similarly a
list
of dimensionally like things (matrices, RleDataFrames, BigMatrix
objects,
HDF5-backed things) with some row meta-data (a DataFrame or
GRangesFrame).
Elegant? Maybe they would actually be relatives in the class tree.
I wonder if this kind of thing would be easier if we had Java-style
Interfaces or duck-typing. The "x" slot of "y" holds something that
implements this set of methods ...
Oh, and kinda apropos, the genoset class will probably go away or become
an extension to this new SE-like thing. The extra stuff that comes
along
with genoset will still be available.
Pete
____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com
On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:
This.
It would be damned near perfect as a return value for assays coming
out of
an object that held several such assays at several time points in a
population, where there are both assay-wise and covariate-wise "holes"
that
could nonetheless be usefully imputed across assays.
Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com>
wrote:
I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that
argument.
Just impossible. As Michael mentioned back in November, they have
conflicting APIs.
Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
1 day later
hi all, just a practical issue: I have GenomicRanges version 1.19.42 on my computer which does not have rowRanges defined, although the 1.19.42 version on the Bioc website does have rowRanges in the man page: http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html So I pass check locally but not in the devel branch on Bioc servers.
library(GenomicRanges) rowRanges
Error: object 'rowRanges' not found
sessionInfo()
R Under development (unstable) (2014-12-08 r67137) Platform: x86_64-apple-darwin12.5.0 (64-bit) locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats4 parallel stats graphics grDevices datasets utils methods base other attached packages: [1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13 IRanges_2.1.41 S4Vectors_0.5.21 [5] BiocGenerics_0.13.6 RUnit_0.4.28 devtools_1.7.0 knitr_1.9 [9] BiocInstaller_1.17.5
On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan <mtmorgan at fredhutch.org> wrote:
On 03/04/2015 10:03 AM, Peter Haverty wrote:
Michael has a good point. The complexity of the BioC universe of classes hurts our ability to attract new users. More classes would be a minus there ... but a small set of common, explicit APIs would simplify things. Rectangular things implement the matrix Interface. :-) Deprecating old stuff, like eSet, might help more than it hurts, on the simplicity front. P.S. apropos of understanding this universe of classes, I *love* the methods(class=x) thing Vincent mentioned.
The current version, under R-devel, is at
devtools::source_gist("https://gist.github.com/mtmorgan/9f98871adb9f0c1891a4")
> methods(class="SummarizedExperiment")
[1] [ [[ [[<- [<- [5] $ $<- assay assay<- [9] assayNames assayNames<- assays assays<- [13] cbind coerce colData colData<- [17] compare Compare countOverlaps coverage [21] dim dimnames dimnames<- disjointBins [25] distance distanceToNearest duplicated elementMetadata [29] elementMetadata<- end end<- exptData [33] exptData<- extractROWS findOverlaps flank [37] follow granges isDisjoint mcols [41] mcols<- narrow nearest order [45] overlapsAny precede ranges ranges<- [49] rank rbind replaceROWS resize [53] restrict rowData rowData<- seqinfo [57] seqinfo<- seqnames shift show [61] sort split start start<- [65] strand strand<- subset subsetByOverlaps [69] updateObject values values<- width [73] width<- see ?"methods" for accessing help and source code and
head(attr(methods(class="SummarizedExperiment"), "info"))
generic visible
[,SummarizedExperiment,ANY-method [ TRUE
[[,SummarizedExperiment,ANY,missing-method [[ TRUE
[[<-,SummarizedExperiment,ANY,missing-method [[<- TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method [<- TRUE
$,SummarizedExperiment-method $ TRUE
$<-,SummarizedExperiment-method $<- TRUE
isS4 from
[,SummarizedExperiment,ANY-method TRUE GenomicRanges
[[,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method TRUE GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE GenomicRanges
$,SummarizedExperiment-method TRUE GenomicRanges
$<-,SummarizedExperiment-method TRUE GenomicRanges
Martin
Pete
____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com
On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <lawrence.michael at gene.com>
wrote:
I think we need to make sure that there are enough benefits of something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to summarization, the
ranges seem primary, after summarization, it may often make sense for them
to be secondary. But I'm just not sure what we gain from a new data
structure.
On Wed, Mar 4, 2015 at 12:28 AM, Herv? Pag?s <hpages at fredhutch.org> wrote:
GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:
- GRanges = a naked GRanges + a DataFrame accessible via mcols()
- GRangesFrame = a DataFrame + a naked GRanges accessible via
some accessor (e.g. rowRanges())
So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!
What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.
It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).
H.
On 03/03/2015 04:35 PM, Michael Lawrence wrote:
Should be possible for the annotations to be of any type, as long as they
satisfy a simple contract of NROW() and 2D "[". Then, you could have a
DataFrame, GRanges, or whatever in there. But it would be nice to have a
special class for the container with range information. The contract for
the range annotation would be to have a granges() method.
I agree it would be nice if there was a way with the methods package to
easily assert such contracts. For example, one could define an interface
with a set of generics (and optionally the relevant position in the
generic
signature). Then, once all of the methods have been assigned for a
particular class, it is made to inherit from that contract class. There
are
lots of gotchas though. Not sure how useful it would be in practice.
On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <haverty.peter at gene.com>
wrote:
There are some nice similarities in these new imaginary types. A
"GRangesFrame" is a list of dimensionally identical things (columns) and
some row meta-data (the GRanges). The SE-like object is similarly a
list
of dimensionally like things (matrices, RleDataFrames, BigMatrix
objects,
HDF5-backed things) with some row meta-data (a DataFrame or
GRangesFrame).
Elegant? Maybe they would actually be relatives in the class tree.
I wonder if this kind of thing would be easier if we had Java-style
Interfaces or duck-typing. The "x" slot of "y" holds something that
implements this set of methods ...
Oh, and kinda apropos, the genoset class will probably go away or become
an extension to this new SE-like thing. The extra stuff that comes
along
with genoset will still be available.
Pete
____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com
On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <tim.triche at gmail.com>
wrote:
This.
It would be damned near perfect as a return value for assays coming
out of
an object that held several such assays at several time points in a
population, where there are both assay-wise and covariate-wise "holes"
that
could nonetheless be usefully imputed across assays.
Statistics is the grammar of science.
Karl Pearson <http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <haverty.peter at gene.com>
wrote:
I still think GRanges should be a subclass of DataFrame,
which would make this easy, but I don't seem to be winning that
argument.
Just impossible. As Michael mentioned back in November, they have
conflicting APIs.
Maybe a new "GRangesFrame" that is a DataFrame and holds a GRanges
(without mcols) as an index?
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
Herv? Pag?s
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel