Hi!
The SummarizedExperiment class is an extremely powerful container for
biological data(thank you!), and all my thinking nowadays is just circling
around how to stuff it as effectively as possible.
Have been using 3 dimension for a long time, which has been very
successful. Now I also have a case for using 4 dimensions. Everything
seemed to work as expected until I tried to subset my object, see example.
library(GenomicRanges)
rowRanges <- GRanges(
seqnames="chrx",
ranges=IRanges(start=1:3,end=4:6),
strand="*"
)
coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
assays <- SimpleList()
#two dim
assays[["dim2"]] <- array(0,dim=c(3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#works
#three dim
assays[["dim3"]] <- array(0,dim=c(3,3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#works
#four dim
assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#does not work
#Error in x[i, , , drop = FALSE] : incorrect number of dimensions
This is also the case for rbind and cbind. Would it be appropriate to ask
you to update the SE functions to handle subset, rbind, cbind also for 4
dimensions? I know the time for next release is very soon, so maybe it is
better to wait until after April 16. Just let me know your thoughts about
it.
Jesper
[Bioc-devel] SummarizedExperiment subset of 4 dimensions
5 messages · Jesper Gådin, Wolfgang Huber, Michael Lawrence
1 day later
Dear Jesper this is maybe not the answer you want to hear, but stuffing in 4, 5, ? dimensions may not be all that useful, as you can always roll out these higher dimensions into the existing third (or even into the second, the SummarizedExperiment columns). There is Hadley?s concept of ?tidy data? (see e.g. http://www.jstatsoft.org/v59/i10 ) ? a paper that is really worthwhile to read ? which implies that the tidy way forward is to stay with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the information that you?d otherwise stuff into the higher dimensions in the colData covariates. Wolfgang Wolfgang Huber Principal Investigator, EMBL Senior Scientist Genome Biology Unit European Molecular Biology Laboratory (EMBL) Heidelberg, Germany T +49-6221-3878823 wolfgang.huber at embl.de http://www.huber.embl.de
On 30 Mar 2015, at 12:38, Jesper G?din <jesper.gadin at gmail.com> wrote:
Hi!
The SummarizedExperiment class is an extremely powerful container for
biological data(thank you!), and all my thinking nowadays is just circling
around how to stuff it as effectively as possible.
Have been using 3 dimension for a long time, which has been very
successful. Now I also have a case for using 4 dimensions. Everything
seemed to work as expected until I tried to subset my object, see example.
library(GenomicRanges)
rowRanges <- GRanges(
seqnames="chrx",
ranges=IRanges(start=1:3,end=4:6),
strand="*"
)
coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
assays <- SimpleList()
#two dim
assays[["dim2"]] <- array(0,dim=c(3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#works
#three dim
assays[["dim3"]] <- array(0,dim=c(3,3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#works
#four dim
assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#does not work
#Error in x[i, , , drop = FALSE] : incorrect number of dimensions
This is also the case for rbind and cbind. Would it be appropriate to ask
you to update the SE functions to handle subset, rbind, cbind also for 4
dimensions? I know the time for next release is very soon, so maybe it is
better to wait until after April 16. Just let me know your thoughts about
it.
Jesper
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Taken in the abstract, the tidy data argument is one for consistent data structures that enable interoperability, which is what we have with SummarizedExperiment. The "long form" or "tidy" data frame is an effective general representation, but if there is additional structure in your data, why not represent it formally? Given the way R lays out the data in arrays, it should be possible to add that fourth dimension, in an assay array, while still using the colData to annotate that structure. It does not make the data any less "tidy", but it does make it more structured.
On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de> wrote:
Dear Jesper this is maybe not the answer you want to hear, but stuffing in 4, 5, ? dimensions may not be all that useful, as you can always roll out these higher dimensions into the existing third (or even into the second, the SummarizedExperiment columns). There is Hadley?s concept of ?tidy data? (see e.g. http://www.jstatsoft.org/v59/i10 ) ? a paper that is really worthwhile to read ? which implies that the tidy way forward is to stay with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the information that you?d otherwise stuff into the higher dimensions in the colData covariates. Wolfgang Wolfgang Huber Principal Investigator, EMBL Senior Scientist Genome Biology Unit European Molecular Biology Laboratory (EMBL) Heidelberg, Germany T +49-6221-3878823 wolfgang.huber at embl.de http://www.huber.embl.de
On 30 Mar 2015, at 12:38, Jesper G?din <jesper.gadin at gmail.com> wrote: Hi! The SummarizedExperiment class is an extremely powerful container for biological data(thank you!), and all my thinking nowadays is just
circling
around how to stuff it as effectively as possible. Have been using 3 dimension for a long time, which has been very successful. Now I also have a case for using 4 dimensions. Everything seemed to work as expected until I tried to subset my object, see
example.
library(GenomicRanges)
rowRanges <- GRanges(
seqnames="chrx",
ranges=IRanges(start=1:3,end=4:6),
strand="*"
)
coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
assays <- SimpleList()
#two dim
assays[["dim2"]] <- array(0,dim=c(3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges,
colData=coldata)
se[1] #works #three dim assays[["dim3"]] <- array(0,dim=c(3,3,3)) se <- SummarizedExperiment(assays, rowRanges = rowRanges,
colData=coldata)
se[1] #works #four dim assays[["dim4"]] <- array(0,dim=c(3,3,3,3)) se <- SummarizedExperiment(assays, rowRanges = rowRanges,
colData=coldata)
se[1]
#does not work
#Error in x[i, , , drop = FALSE] : incorrect number of dimensions
This is also the case for rbind and cbind. Would it be appropriate to ask
you to update the SE functions to handle subset, rbind, cbind also for 4
dimensions? I know the time for next release is very soon, so maybe it is
better to wait until after April 16. Just let me know your thoughts about
it.
Jesper
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Hi Michael where would you put the ?colData?-style metadata for the 3rd, 4th, ? dimensions? As an (ex-)physicists of course I like arrays, and the more dimensions the better, but in practical work I?ve consistently been bitten by the rigidity of such a design choice too early in a process. Wolfgang
On 31 Mar 2015, at 13:32, Michael Lawrence <lawrence.michael at gene.com> wrote: Taken in the abstract, the tidy data argument is one for consistent data structures that enable interoperability, which is what we have with SummarizedExperiment. The "long form" or "tidy" data frame is an effective general representation, but if there is additional structure in your data, why not represent it formally? Given the way R lays out the data in arrays, it should be possible to add that fourth dimension, in an assay array, while still using the colData to annotate that structure. It does not make the data any less "tidy", but it does make it more structured. On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de <mailto:whuber at embl.de>> wrote: Dear Jesper this is maybe not the answer you want to hear, but stuffing in 4, 5, ? dimensions may not be all that useful, as you can always roll out these higher dimensions into the existing third (or even into the second, the SummarizedExperiment columns). There is Hadley?s concept of ?tidy data? (see e.g. http://www.jstatsoft.org/v59/i10 <http://www.jstatsoft.org/v59/i10> ) ? a paper that is really worthwhile to read ? which implies that the tidy way forward is to stay with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the information that you?d otherwise stuff into the higher dimensions in the colData covariates. Wolfgang Wolfgang Huber Principal Investigator, EMBL Senior Scientist Genome Biology Unit European Molecular Biology Laboratory (EMBL) Heidelberg, Germany T +49-6221-3878823 <tel:%2B49-6221-3878823> wolfgang.huber at embl.de <mailto:wolfgang.huber at embl.de> http://www.huber.embl.de <http://www.huber.embl.de/>
On 30 Mar 2015, at 12:38, Jesper G?din <jesper.gadin at gmail.com <mailto:jesper.gadin at gmail.com>> wrote:
Hi!
The SummarizedExperiment class is an extremely powerful container for
biological data(thank you!), and all my thinking nowadays is just circling
around how to stuff it as effectively as possible.
Have been using 3 dimension for a long time, which has been very
successful. Now I also have a case for using 4 dimensions. Everything
seemed to work as expected until I tried to subset my object, see example.
library(GenomicRanges)
rowRanges <- GRanges(
seqnames="chrx",
ranges=IRanges(start=1:3,end=4:6),
strand="*"
)
coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
assays <- SimpleList()
#two dim
assays[["dim2"]] <- array(0,dim=c(3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#works
#three dim
assays[["dim3"]] <- array(0,dim=c(3,3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#works
#four dim
assays[["dim4"]] <- array(0,dim=c(3,3,3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges, colData=coldata)
se[1]
#does not work
#Error in x[i, , , drop = FALSE] : incorrect number of dimensions
This is also the case for rbind and cbind. Would it be appropriate to ask
you to update the SE functions to handle subset, rbind, cbind also for 4
dimensions? I know the time for next release is very soon, so maybe it is
better to wait until after April 16. Just let me know your thoughts about
it.
Jesper
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
_______________________________________________ Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
One would need a long-form colData that aligns with the array. But now I realize that's not what Jesper wants to do here, and is not how SE is currently designed. Jesper is using the third (and now fourth) dimension to store an additional dimension of information about the same sample. We already support 3D arrays for this, presumably motivated VCF, where, for example, each sample can have a probability for WT, het, or hom at each position. In that case, all of the values are genotype likelihoods, i.e., they all measure the same thing, so they seem to belong in the same assay. But they're also the same biological "sample". Essentially, we have complex measurements that might be a vector, or for Jesper even a matrix. The important question for interoperability is whether we want there to be a contract that assays are always two dimensions. I guess we've already violated that with VCF. Extending to a fourth is not really hurting anything.
On Tue, Mar 31, 2015 at 4:52 AM, Wolfgang Huber <whuber at embl.de> wrote:
Hi Michael where would you put the ?colData?-style metadata for the 3rd, 4th, ? dimensions? As an (ex-)physicists of course I like arrays, and the more dimensions the better, but in practical work I?ve consistently been bitten by the rigidity of such a design choice too early in a process. Wolfgang On 31 Mar 2015, at 13:32, Michael Lawrence <lawrence.michael at gene.com> wrote: Taken in the abstract, the tidy data argument is one for consistent data structures that enable interoperability, which is what we have with SummarizedExperiment. The "long form" or "tidy" data frame is an effective general representation, but if there is additional structure in your data, why not represent it formally? Given the way R lays out the data in arrays, it should be possible to add that fourth dimension, in an assay array, while still using the colData to annotate that structure. It does not make the data any less "tidy", but it does make it more structured. On Tue, Mar 31, 2015 at 4:14 AM, Wolfgang Huber <whuber at embl.de> wrote:
Dear Jesper this is maybe not the answer you want to hear, but stuffing in 4, 5, ? dimensions may not be all that useful, as you can always roll out these higher dimensions into the existing third (or even into the second, the SummarizedExperiment columns). There is Hadley?s concept of ?tidy data? (see e.g. http://www.jstatsoft.org/v59/i10 ) ? a paper that is really worthwhile to read ? which implies that the tidy way forward is to stay with 2 (or maybe 3) dimensions in SummarizedExperiment, and to record the information that you?d otherwise stuff into the higher dimensions in the colData covariates. Wolfgang Wolfgang Huber Principal Investigator, EMBL Senior Scientist Genome Biology Unit European Molecular Biology Laboratory (EMBL) Heidelberg, Germany T +49-6221-3878823 wolfgang.huber at embl.de http://www.huber.embl.de
On 30 Mar 2015, at 12:38, Jesper G?din <jesper.gadin at gmail.com> wrote: Hi! The SummarizedExperiment class is an extremely powerful container for biological data(thank you!), and all my thinking nowadays is just
circling
around how to stuff it as effectively as possible. Have been using 3 dimension for a long time, which has been very successful. Now I also have a case for using 4 dimensions. Everything seemed to work as expected until I tried to subset my object, see
example.
library(GenomicRanges)
rowRanges <- GRanges(
seqnames="chrx",
ranges=IRanges(start=1:3,end=4:6),
strand="*"
)
coldata <- DataFrame(row.names=paste("s",1:3, sep=""))
assays <- SimpleList()
#two dim
assays[["dim2"]] <- array(0,dim=c(3,3))
se <- SummarizedExperiment(assays, rowRanges = rowRanges,
colData=coldata)
se[1] #works #three dim assays[["dim3"]] <- array(0,dim=c(3,3,3)) se <- SummarizedExperiment(assays, rowRanges = rowRanges,
colData=coldata)
se[1] #works #four dim assays[["dim4"]] <- array(0,dim=c(3,3,3,3)) se <- SummarizedExperiment(assays, rowRanges = rowRanges,
colData=coldata)
se[1] #does not work #Error in x[i, , , drop = FALSE] : incorrect number of dimensions This is also the case for rbind and cbind. Would it be appropriate to
ask
you to update the SE functions to handle subset, rbind, cbind also for 4 dimensions? I know the time for next release is very soon, so maybe it
is
better to wait until after April 16. Just let me know your thoughts
about
it.
Jesper
[[alternative HTML version deleted]]
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel