[Bioc-devel] SummarizedExperiments

On Thu, Aug 30, 2012 at 6:27 AM, Tim Triche, Jr. <tim.triche at gmail.com>wrote:

nb.  one of the reasons for the existence of the MergedDataSet class in
regulatoR (to be submitted for review shortly) is that, while SEs are
absolutely fantastic for managing data matrices that are stapled to a
GRanges, what is less awesome is having a relatively light-weight DataFrame
for phenotypic data that requires the entire memory footprint be recreated
upon writing a new column into said DataFrame.

If R5 classes didn't spook me a little, I would already have done something

We don't often use this R5 terminology but I see Hadley has made an
accessible document referring to reference classes in this way.
To me the challenge is more conceptual -- pass-by-reference and the way 
that two variables pointing to the instance are updated at the same time 
-- and I had been thinking of a LockedEnvironment-style implementation 
where some operations were free ('copying') but others weren't (subset, 
subset assign). But maybe there are some more direct approaches...
My 2c: This is a situation where some experimental data would be helpful.
Yes, for instance where in the interactive use is time being spent? Is 
it copying the assays, or validity, or actually updating the row data? 
Is 500000 x 800 an appropriate scale to be thinking about?
  The main avenues for a developer seem to be a) use environments or
reference classes; there are some costs and we should understand them, and
b) use an out-of-memory approach like rhdf5 or ff.  Again there will be
some costs.  It should be relatively easy to experiment with these.  One
thing I just learned about is setValidity2 and disableValidity (defined in
IRanges IIRC) ... these allow you to construct certain variations on
SummarizedExperiment with less attention to deeper infrastructure.
probably I can make better use of the insights the IRanges guys have had 
in their careful development and application of validity methods, though 
I feel a bit like these are 'attractive hazards' that tempt us to do 
unsafe things and then pay the price later. This is likely the first 
direction I'll explore.

Exploring a little I already see that there are some pretty dumb things 
being done in assignment.

Martin
whereby the assays/libraries for a given study subject are all pointed to
as SEs (i.e. RNAseq, BSseq, expression/methylation arrays, CNV/SNP arrays,
WGS or exomic DNAseq) and the column (phenotype) data can avoid being
subject to these constraints.  Truth be told I *still* want to do that
because, most of the time, updates to the latter are independent of, and
happen subsequently to, loading the former.

Suggestions would be welcome, because other than these minor niggles, the
SummarizedExperiment class is almost perfect for many tasks.

On Wed, Aug 29, 2012 at 9:57 PM, Tim Triche, Jr. <tim.triche at gmail.com
wrote:

assigning new colData columns, or overwriting old ones, in a sizable (say
500000 row x 800 column) SE is nauseatingly slow.
There has to be a better way -- I'm willing to write it if someone can
point out an obvious way to do it

On Wed, Aug 29, 2012 at 9:52 PM, Kasper Daniel Hansen <
kasperdanielhansen at gmail.com> wrote:

On Thu, Aug 30, 2012 at 12:44 AM, Martin Morgan <mtmorgan at fhcrc.org>
wrote:
On 08/29/2012 06:46 PM, Kasper Daniel Hansen wrote:
There is a lot of good stuff to say about SummarizedExperiments, and
from a certain point of view I have a parallel implementation in
bsseq
(and there is also one in genoset).

However, I really like having the assayData inside an environment.
This helps some on memory and - equally important - speed at the
command line.  I certainly need to very heavily consider using an
environment in bsseq.

After some discussion with Tim (Triche) we have agreed that something
like SummarizedExperiments is the way to go at least for the
methylation arrays.  We need to be able to easily handle 1000s of
samples.

What is the chance that we can get the option of having the assayData
inside an environment, perhaps by
    Making a class that is an environment and inherits from
SimpleList.
    Using a classUnion between the existing class of the assayData and
an environment.
    Third option that is probably better than the proceeding two, but
which I cannot come up with right now.

Probably something can / will be done. I guess the slowness you're
talking
about is when rowData / colData columns are manipulated; any kind of
subsetting would mean a 'deep' copy. Martin
Yes, for example manipulating colData - something that conceptually
should be quick and easy.  Of course, this will not affect any real
computation on the assayData matrices, but it will make life at the
command prompt more pleasant.

Kasper

This would - in my opinion - be very nice and worthwhile.

Kasper

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

--
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<
http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

--
*A model is a lie that helps you see the truth.*
*
*
Howard Skipper<
http://cancerres.aacrjournals.org/content/31/9/1173.full.pdf>

         [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

	[[alternative HTML version deleted]]

_______________________________________________
Bioc-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

[Bioc-devel] SummarizedExperiments

Thread (13 messages)