[Bioc-devel] SummarizedExperiments
On 08/30/2012 04:42 AM, Vincent Carey wrote:
On Thu, Aug 30, 2012 at 6:27 AM, Tim Triche, Jr. <tim.triche at gmail.com>wrote:
nb. one of the reasons for the existence of the MergedDataSet class in regulatoR (to be submitted for review shortly) is that, while SEs are absolutely fantastic for managing data matrices that are stapled to a GRanges, what is less awesome is having a relatively light-weight DataFrame for phenotypic data that requires the entire memory footprint be recreated upon writing a new column into said DataFrame. If R5 classes didn't spook me a little, I would already have done something
We don't often use this R5 terminology but I see Hadley has made an accessible document referring to reference classes in this way.
To me the challenge is more conceptual -- pass-by-reference and the way
that two variables pointing to the instance are updated at the same time
-- and I had been thinking of a LockedEnvironment-style implementation
where some operations were free ('copying') but others weren't (subset,
subset assign). But maybe there are some more direct approaches...
My 2c: This is a situation where some experimental data would be helpful.
Yes, for instance where in the interactive use is time being spent? Is it copying the assays, or validity, or actually updating the row data? Is 500000 x 800 an appropriate scale to be thinking about?
The main avenues for a developer seem to be a) use environments or reference classes; there are some costs and we should understand them, and b) use an out-of-memory approach like rhdf5 or ff. Again there will be some costs. It should be relatively easy to experiment with these. One thing I just learned about is setValidity2 and disableValidity (defined in IRanges IIRC) ... these allow you to construct certain variations on SummarizedExperiment with less attention to deeper infrastructure.
probably I can make better use of the insights the IRanges guys have had in their careful development and application of validity methods, though I feel a bit like these are 'attractive hazards' that tempt us to do unsafe things and then pay the price later. This is likely the first direction I'll explore. Exploring a little I already see that there are some pretty dumb things being done in assignment. Martin
whereby the assays/libraries for a given study subject are all pointed to as SEs (i.e. RNAseq, BSseq, expression/methylation arrays, CNV/SNP arrays, WGS or exomic DNAseq) and the column (phenotype) data can avoid being subject to these constraints. Truth be told I *still* want to do that because, most of the time, updates to the latter are independent of, and happen subsequently to, loading the former. Suggestions would be welcome, because other than these minor niggles, the SummarizedExperiment class is almost perfect for many tasks. On Wed, Aug 29, 2012 at 9:57 PM, Tim Triche, Jr. <tim.triche at gmail.com
wrote:
assigning new colData columns, or overwriting old ones, in a sizable (say 500000 row x 800 column) SE is nauseatingly slow.
There has to be a better way -- I'm willing to write it if someone can point out an obvious way to do it On Wed, Aug 29, 2012 at 9:52 PM, Kasper Daniel Hansen < kasperdanielhansen at gmail.com> wrote:
On Thu, Aug 30, 2012 at 12:44 AM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
On 08/29/2012 06:46 PM, Kasper Daniel Hansen wrote:
There is a lot of good stuff to say about SummarizedExperiments, and from a certain point of view I have a parallel implementation in
bsseq
(and there is also one in genoset).
However, I really like having the assayData inside an environment.
This helps some on memory and - equally important - speed at the
command line. I certainly need to very heavily consider using an
environment in bsseq.
After some discussion with Tim (Triche) we have agreed that something
like SummarizedExperiments is the way to go at least for the
methylation arrays. We need to be able to easily handle 1000s of
samples.
What is the chance that we can get the option of having the assayData
inside an environment, perhaps by
Making a class that is an environment and inherits from
SimpleList.
Using a classUnion between the existing class of the assayData and
an environment.
Third option that is probably better than the proceeding two, but
which I cannot come up with right now.
Probably something can / will be done. I guess the slowness you're
talking
about is when rowData / colData columns are manipulated; any kind of subsetting would mean a 'deep' copy. Martin
Yes, for example manipulating colData - something that conceptually should be quick and easy. Of course, this will not affect any real computation on the assayData matrices, but it will make life at the command prompt more pleasant. Kasper
This would - in my opinion - be very nice and worthwhile. Kasper
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
-- *A model is a lie that helps you see the truth.* * * Howard Skipper<
-- *A model is a lie that helps you see the truth.* * * Howard Skipper< http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf> [[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793