[Bioc-devel] Changes to the SummarizedExperiment Class

I am glad you are keeping this discussion alive Kasper.

On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen <

It sounds like the proposed changes are already made.  However (like
others) I am still a bit mystified why this was necessary.  The old version
did allow for a GRanges inside the DataFrame of the rowData, as far as I
recall.  So I assume this is for efficiency.  But why?  What kind of
data/use cases is this for?

I am happy to hear that SummarizedExperiment is going to be spun out into
its own package.  When that happens, I have some comments, which I'll
include here in anticipation
  1) I now very strongly believe it was a design mistake to not have
colnames on the assays.  The advantage of this choice is that sampleNames
are only stored one place.  The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.

after example(SummarizedExperiment)
colnames(assays(se1)[[1]])
[1] "A" "B" "C" "D" "E" "F"

so this seems to be optional.  But attempts to set rownames will fail
silently
rownames(assays(se1)[[1]]) = as.character(1:200)
rownames(assays(se1)[[1]])
NULL
seems we could issue a warning there

  2) I still strongly believe we should support pData, sampleNames etc etc
on SummarizedExperiments.

worthy of discussion
  3) Having developed a package (minfi) where eSets co-exists with
SummarizedExperiment, I have to mention that for the developer there is a
number of places where the different internals of these two classes makes
like irritating.  For this reason I would support a "modern" implementation
of eSet, in parallel with SummarizedExperiment.

also worthy of further discussion IMHO
Best,
Kasper

On Fri, Mar 6, 2015 at 10:59 AM, Valerie Obenchain <vobencha at fredhutch.org

wrote:

Hi Mike,

Our error - we didn't bump GenomicRanges when rowRanges was added.
Hopefully 1.19.43 will propagate today and things will be sorted out.

Val

On 03/06/2015 07:40 AM, Michael Love wrote:

hi all,

just a practical issue: I have GenomicRanges version 1.19.42 on my
computer which does not have rowRanges defined, although the 1.19.42
version on the Bioc website does have rowRanges in the man page:

http://master.bioconductor.org/packages/3.1/bioc/html/GenomicRanges.html
So I pass check locally but not in the devel branch on Bioc servers.

 library(GenomicRanges)
rowRanges

Error: object 'rowRanges' not found

sessionInfo()

R Under development (unstable) (2014-12-08 r67137)
Platform: x86_64-apple-darwin12.5.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils
    methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
S4Vectors_0.5.21
[5] BiocGenerics_0.13.6   RUnit_0.4.28          devtools_1.7.0
knitr_1.9
[9] BiocInstaller_1.17.5

On Wed, Mar 4, 2015 at 3:03 PM, Martin Morgan <mtmorgan at fredhutch.org>
wrote:

On 03/04/2015 10:03 AM, Peter Haverty wrote:

Michael has a good point. The complexity of the BioC universe of
classes
hurts our ability to attract new users. More classes would be a minus
there
... but a small set of common, explicit APIs would simplify things.
Rectangular things implement the matrix Interface.  :-) Deprecating
old
stuff, like eSet, might help more than it hurts, on the simplicity
front.

P.S. apropos of understanding this universe of classes, I *love* the
methods(class=x) thing Vincent mentioned.

The current version, under R-devel, is at

   devtools::source_gist("https://gist.github.com/mtmorgan/
9f98871adb9f0c1891a4")

   > methods(class="SummarizedExperiment")
    [1] [                 [[                [[<-              [<-
    [5] $                 $<-               assay             assay<-
    [9] assayNames        assayNames<-      assays            assays<-
   [13] cbind             coerce            colData           colData<-
   [17] compare           Compare           countOverlaps     coverage
   [21] dim               dimnames          dimnames<-
disjointBins
   [25] distance          distanceToNearest duplicated
elementMetadata
   [29] elementMetadata<- end               end<-             exptData
   [33] exptData<-        extractROWS       findOverlaps      flank
   [37] follow            granges           isDisjoint        mcols
   [41] mcols<-           narrow            nearest           order
   [45] overlapsAny       precede           ranges            ranges<-
   [49] rank              rbind             replaceROWS       resize
   [53] restrict          rowData           rowData<-         seqinfo
   [57] seqinfo<-         seqnames          shift             show
   [61] sort              split             start             start<-
   [65] strand            strand<-          subset
subsetByOverlaps
   [69] updateObject      values            values<-          width
   [73] width<-

   see ?"methods" for accessing help and source code

and

 head(attr(methods(class="SummarizedExperiment"), "info"))

                                                              generic
visible
[,SummarizedExperiment,ANY-method                                  [
TRUE
[[,SummarizedExperiment,ANY,missing-method                        [[
TRUE
[[<-,SummarizedExperiment,ANY,missing-method                    [[<-
TRUE
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method     [<-
TRUE
$,SummarizedExperiment-method                                      $
TRUE
$<-,SummarizedExperiment-method                                  $<-
TRUE
                                                              isS4
    from
[,SummarizedExperiment,ANY-method                            TRUE
GenomicRanges
[[,SummarizedExperiment,ANY,missing-method                   TRUE
GenomicRanges
[[<-,SummarizedExperiment,ANY,missing-method                 TRUE
GenomicRanges
[<-,SummarizedExperiment,ANY,ANY,SummarizedExperiment-method TRUE
GenomicRanges
$,SummarizedExperiment-method                                TRUE
GenomicRanges
$<-,SummarizedExperiment-method                              TRUE
GenomicRanges

Martin

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Wed, Mar 4, 2015 at 9:38 AM, Michael Lawrence <
lawrence.michael at gene.com>
wrote:

 I think we need to make sure that there are enough benefits of
something
like GRangesFrame before we introduce yet another complicated and
overlapping data structure into the framework. Prior to
summarization,
the
ranges seem primary, after summarization, it may often make sense for
them
to be secondary. But I'm just not sure what we gain from a new data
structure.

On Wed, Mar 4, 2015 at 12:28 AM, Herv? Pag?s <hpages at fredhutch.org>
wrote:

 GRangesFrame is an interesting idea and I gave it some thoughts.
There is this nice symmetry between GRanges and GRangesFrame:

- GRanges = a naked GRanges + a DataFrame accessible via mcols()

- GRangesFrame = a DataFrame + a naked GRanges accessible via
                   some accessor (e.g. rowRanges())

So GRanges and GRangesFrame are equivalent in terms of what they
can hold, but different in terms of API: the former has the ranges
API as primary API and the DataFrame API on its mcols() component,
and the latter has the DataFrame API as primary API and the ranges
API on its rowRanges() component. Nice switch!

What does this API switch bring us? A GRangesFrame object is now
an object that fully behaves like a DataFrame and people can also
perform range-based operations on its rowRanges() component.
Here is what I'm afraid is going to happen: people will also want
to be able to perform range-based operations *directly* on
these objects, i.e. without having to call rowRanges() first.
So for example when they do subsetByOverlaps(), subsetting
happens vertically. Also the Hits object returned by findOverlaps()
would contain row indices. Problem with this is that these objects
now start to suffer from the "dual personality syndrome". For
example, it's not clear anymore what their length should be.
Strictly speaking it should be their number of columns (that's
what the length of a DataFrame is), but the ranges API that
we're trying to put on them also makes them feel like vectors
along the vertical dimension so it also feels that their length
should be their number of rows. Same thing with 1D subsetting.
Why does it subset the columns and not the rows? Most people
are now confused.

It's interesting to note that the same thing happens with GRanges
objects, but in the opposite direction: people wish they could
do DataFrame operations directly on them without calling mcols()
first. But in order to preserve the good health of GRanges objects,
we've not done that (except for $, a shortcut for mcols(x)$,
the pressure was just too strong).

H.

On 03/03/2015 04:35 PM, Michael Lawrence wrote:

 Should be possible for the annotations to be of any type, as long
as
they
satisfy a simple contract of NROW() and 2D "[". Then, you could
have
a
DataFrame, GRanges, or whatever in there. But it would be nice to
have a
special class for the container with range information. The
contract
for
the range annotation would be to have a granges() method.

I agree it would be nice if there was a way with the methods
package
to
easily assert such contracts. For example, one could define an
interface
with a set of generics (and optionally the relevant position in the
generic
signature). Then, once all of the methods have been assigned for a
particular class, it is made to inherit from that contract class.
There
are
lots of gotchas though. Not sure how useful it would be in
practice.

On Tue, Mar 3, 2015 at 4:07 PM, Peter Haverty <
haverty.peter at gene.com>
wrote:

   There are some nice similarities in these new imaginary types.
A

"GRangesFrame" is a list of dimensionally identical things
(columns) and
some row meta-data (the GRanges).  The SE-like object is
similarly a
list
of dimensionally like things (matrices, RleDataFrames, BigMatrix
objects,
HDF5-backed things) with some row meta-data (a DataFrame or
GRangesFrame).
Elegant?  Maybe they would actually be relatives in the class
tree.
I wonder if this kind of thing would be easier if we had
Java-style
Interfaces or duck-typing.  The "x" slot of "y" holds something
that
implements this set of methods ...

Oh, and kinda apropos, the genoset class will probably go away or
become
an extension to this new SE-like thing.  The extra stuff that
comes
along
with genoset will still be available.

Pete

____________________
Peter M. Haverty, Ph.D.
Genentech, Inc.
phaverty at gene.com

On Tue, Mar 3, 2015 at 3:42 PM, Tim Triche, Jr. <
tim.triche at gmail.com>
wrote:

   This.

It would be damned near perfect as a return value for assays
coming
out of
an object that held several such assays at several time points
in a
population, where there are both assay-wise and covariate-wise
"holes"
that
could nonetheless be usefully imputed across assays.

Statistics is the grammar of science.
Karl Pearson <
http://en.wikipedia.org/wiki/The_Grammar_of_Science>
On Tue, Mar 3, 2015 at 3:25 PM, Peter Haverty <
haverty.peter at gene.com>
wrote:

    I still think GRanges should be a subclass of DataFrame,

 which would make this easy, but I don't seem to be winning
that
 argument.

 Just impossible. As Michael mentioned back in November, they
have
conflicting APIs.

Maybe a new "GRangesFrame" that is a DataFrame and holds a
GRanges
(without mcols) as an index?

           [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

            [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

           [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

 --
Herv? Pag?s

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, Seattle, WA 98109

Email: vobencha at fredhutch.org
Phone: (206) 667-3158

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Bioc-devel] Changes to the SummarizedExperiment Class

Thread (12 messages)