[Bioc-devel] Changes to the SummarizedExperiment Class
Yes, you're right! Sorry for the noise. I forgot this was how it always behaved. All I had to do was change the argument name.
On Wed, Apr 1, 2015 at 3:51 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Hi Michael, On 04/01/2015 07:17 AM, Michael Love wrote:
I'll retract those last two emails about empty GRanges. That's simply: se <- SummarizedExperiment(assays, colData=colData) mcols(se) <- myDataFrame
Glad you found a simple way to do what you wanted. More below...
On Tue, Mar 31, 2015 at 4:40 PM, Michael Love <michaelisaiahlove at gmail.com> wrote:
Would this code inspired by the release version of GenomicRanges work? e.g. if I want to add a DataFrame with 10 rows: names <- letters[1:10] x <- relist(GRanges(), PartitioningByEnd(integer(10), names=names)) mcols(x) <- DataFrame(foo=1:10) Then give x to the rowRanges argument of SummarizedExperiment? On Tue, Mar 31, 2015 at 3:49 PM, Michael Love <michaelisaiahlove at gmail.com> wrote:
I forgot to ask my other question. I had gone in early March and fixed my code to eliminate rowData<-, but the argument to SummarizedExperiment was still called rowData, and a DataFrame could be provided. Then I didn't check for a few weeks, but the argument for the rowData slot is now called rowRanges. What's the trick to putting a DataFrame on an empty GRanges, so I can get the old behavior but now using the rowRanges argument?
I'm not sure what you meant by "so I can get the old behavior but now using the rowRanges argument". Just to clarify: the renaming of rowData to rowRanges is a change of name only, not a change of behavior. More precisely the new rowRanges() accessor should behave exactly as the old rowData() accessor. The same applies to the 'rowRanges' argument of the SummarizedExperiment() constructor. So whatever you were passing before to the 'rowData' argument, you should still be able to pass it to the new 'rowRanges' argument. Please let us know if it's not the case as this is certainly not intended. Thanks, H.
On Tue, Mar 31, 2015 at 3:40 PM, Michael Love <michaelisaiahlove at gmail.com> wrote:
With GenomicRanges 1.19.48, I'm still having issues with re-naming the first assay and duplication of memory from my March 9 email. I tried assayNames<- as well. My use case is if I am given a SummarizedExperiment where the first element is not named "counts" (albeit the SE is most likely coming from summarizeOverlaps() and already named "counts"...).
sessionInfo()
R Under development (unstable) (2015-03-31 r68129)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
Running under: OS X 10.8.5 (Mountain Lion)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices datasets utils
methods base
other attached packages:
[1] GenomicRanges_1.19.48 GenomeInfoDb_1.3.16 IRanges_2.1.43
S4Vectors_0.5.22
[5] BiocGenerics_0.13.10 testthat_0.9.1 devtools_1.7.0
knitr_1.9
[9] BiocInstaller_1.17.6
loaded via a namespace (and not attached):
[1] formatR_1.1 XVector_0.7.4 tools_3.3.0 stringr_0.6.2
evaluate_0.5.5
On Mon, Mar 9, 2015 at 1:21 PM, Michael Love
<michaelisaiahlove at gmail.com> wrote:
On Mar 9, 2015 12:36 PM, "Martin Morgan" <mtmorgan at fredhutch.org> wrote:
On 03/09/2015 08:07 AM, Michael Love wrote:
Some guidance on how to avoid duplication of the matrix for developers would be greatly appreciated.
It's unsatisfactory, but using withDimnames=FALSE avoids duplication on extraction of assays (but obviously you don't have dimnames on the matrix). Row or column subsetting necessarily causes the subsetted assay data to be duplicated. There should not be any duplication when rowRanges() or colData() are changed without changing their dimension / ordering.
Thanks Martin for checking into the regression. Sorry, I should have been more specific earlier, I meant more guidance/documentation in the man page for SE. I scanned the 'Extension' section but didn't find a note on withDimnames for extracting the matrix or this example of renaming the assays (it seems like this could easily be relevant for other package authors). A prominent note there might help devs write more memory efficient packages. The argument section mentions speed but I'd explicitly mention memory given that we're often storing big matrices: "Setting withDimnames=FALSE increases the speed with which assays are extracted." (its entirely possible the info is there but i missed it) Best, Mike
Another example of a trouble point, is that if I am given an SE with an unnamed assay and I need to give the assay a name, this also can expand the memory used. I had found a solution (which works with GenomicRanges 1.18 / current release) with: names(assays(se, withDimnames=FALSE))[1] <- "foo" But now I'm looking in devel and this appears to no longer work. The memory used expands, equivalent to: names(assays(se))[1] <- "foo" Here's some code to try this: m <- matrix(1:1e7,ncol=10,dimnames=list(1:1e6,1:10)) se <- SummarizedExperiment(m) names(assays(se, withDimnames=FALSE))[1] <- "foo" names(assays(se))[1] <- "foo" while running gc() in between steps.
I think this is a regression of some sort, and I'll look into it. Thanks for the heads-up. Martin
On Mon, Mar 9, 2015 at 10:36 AM, Kasper Daniel Hansen <kasperdanielhansen at gmail.com> wrote:
On Mon, Mar 9, 2015 at 10:30 AM, Vincent Carey <stvjc at channing.harvard.edu> wrote:
I am glad you are keeping this discussion alive Kasper. On Mon, Mar 9, 2015 at 10:06 AM, Kasper Daniel Hansen < kasperdanielhansen at gmail.com> wrote:
It sounds like the proposed changes are already made. However
(like
others) I am still a bit mystified why this was necessary. The
old
version
did allow for a GRanges inside the DataFrame of the rowData, as
far as I
recall. So I assume this is for efficiency. But why? What kind
of
data/use cases is this for?
I am happy to hear that SummarizedExperiment is going to be spun
out into
its own package. When that happens, I have some comments, which
I'll
include here in anticipation
1) I now very strongly believe it was a design mistake to not
have
colnames on the assays. The advantage of this choice is that
sampleNames
are only stored one place. The extreme disadvantage is the high
ineffeciency when you want colnames on an extracted assay.
after example(SummarizedExperiment)
colnames(assays(se1)[[1]])
[1] "A" "B" "C" "D" "E" "F" so this seems to be optional. But attempts to set rownames will fail silently
rownames(assays(se1)[[1]]) = as.character(1:200)
rownames(assays(se1)[[1]])
NULL seems we could issue a warning there
Vince, you need to be careful here.
The assays are stored without colnames (unless something has
recently
changed). The default is to - upon extraction - set the colnames
of the
matrix. This however requires a copy of the entire matrix. So
essentially, upon extraction, each assay is needlessly duplicated
to add
the colnames. This is what I mean by inefficient. I would prefer
to store
the assays with colnames. This means that changing sampleNames of
the
object will be inefficient (as it is for eSets) since it would
require a
complete copy of everything. But I would rather - much rather -
copy when
setting sampleNames than copy when extracting an assay.
Best,
Kasper
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Herv? Pag?s Program in Computational Biology Division of Public Health Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514 P.O. Box 19024 Seattle, WA 98109-1024 E-mail: hpages at fredhutch.org Phone: (206) 667-5791 Fax: (206) 667-1319