Hello,
I have a matrix with column names. When creating a DESeqDataSet, the resulting matrix has changed column names to numbers. This causes problems if trying to create an ExpressionSet with it, for example, after regularised logarithm transformation, because the sample names no longer agree.
exprMatrix <- matrix(c(rnbinom(50, 1/0.15, mu = 30), rnbinom(50, 1/0.15, mu = 10)), ncol = 10)
colnames(exprMatrix) <- LETTERS[1:10]
exampleDDS <- DESeqDataSetFromMatrix(exprMatrix, data.frame(class = rep(c("Poor", "Good"), each = 5)), formula(~ class))
groupsTable <- data.frame(class = rep(c("Poor", "Good"), each = 5))
rownames(groupsTable) <- LETTERS[1:10]
exampleSet <- ExpressionSet(counts(exampleDDS), AnnotatedDataFrame(groupsTable))
Error in validObject(.Object) :
invalid class ?ExpressionSet? object: 1: sampleNames differ between assayData and phenoData
invalid class ?ExpressionSet? object: 2: sampleNames differ between phenoData and protocolData
--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
[Bioc-devel] DESeqDataSetFromMatrix Changes Column Names
4 messages · Dario Strbenac, Michael Love
1 day later
Hi Dario, Which version are you using? I think the column names of the matrices in the assays of SummarizedExperiment are coming from the rownames of colData. My priority is to avoid doubling the memory footprint in object creation. I think preserving the colnames of the matrix was relevant to memory usage at some point, but I forget exactly the details. Mike On Aug 25, 2014 10:35 PM, "Dario Strbenac" <dstr7320 at uni.sydney.edu.au> wrote:
Hello,
I have a matrix with column names. When creating a DESeqDataSet, the
resulting matrix has changed column names to numbers. This causes problems
if trying to create an ExpressionSet with it, for example, after
regularised logarithm transformation, because the sample names no longer
agree.
exprMatrix <- matrix(c(rnbinom(50, 1/0.15, mu = 30), rnbinom(50, 1/0.15,
mu = 10)), ncol = 10)
colnames(exprMatrix) <- LETTERS[1:10]
exampleDDS <- DESeqDataSetFromMatrix(exprMatrix, data.frame(class =
rep(c("Poor", "Good"), each = 5)), formula(~ class))
groupsTable <- data.frame(class = rep(c("Poor", "Good"), each = 5))
rownames(groupsTable) <- LETTERS[1:10]
exampleSet <- ExpressionSet(counts(exampleDDS),
AnnotatedDataFrame(groupsTable))
Error in validObject(.Object) :
invalid class ?ExpressionSet? object: 1: sampleNames differ between
assayData and phenoData
invalid class ?ExpressionSet? object: 2: sampleNames differ between
phenoData and protocolData
--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
I am using the latest release version. I understand your recommendation about colData and will use it. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
hi Dario,
Here's some example behavior of SummarizedExperiment (here in devel).
The renaming behavior is coming from GenomicRanges. Anyway I can't
avoid the duplication of memory in the case of a conflict of colnames
of the matrix and the rownames of colData, unless I internally
overwrite the rownames of colData. But I don't think I would do this
because the standard is to let the colData take precedence.
watch the Vcells (used):
library(GenomicRanges)
gc()
m = matrix(rnorm(5e6),ncol=100,dimnames=list(1:5e4,paste0("a",1:100)))
gc() # 40 Mb or so taken by m
se = SummarizedExperiment(m)
gc() # no duplication after creating se
rm(se)
se = SummarizedExperiment(m,
colData=DataFrame(x=1:100,row.names=paste0("b",1:100)))
colnames(se) # colData takes precedence of colnames of se
colnames(assay(se)) # and of the colnames of m
gc() # note a duplication,
# because the colnames of the matrix in assay() were replaced
rm(se)
se = SummarizedExperiment(m, colData=DataFrame(x=1:100,row.names=colnames(m)))
gc() # no duplication, same names.
# so you can use this code to insist that
# the colnames of the DESeqDataSet come from the counts matrix
rm(m,se)
m = matrix(rnorm(5e6),ncol=100)
gc()
se = SummarizedExperiment(m, colData=DataFrame(x=1:100))
gc() # no duplication if m has no colnames going in
R Under development (unstable) (2014-06-05 r65862)
Platform: x86_64-apple-darwin12.5.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] parallel stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] GenomicRanges_1.17.35 GenomeInfoDb_1.1.18 IRanges_1.99.24
[4] S4Vectors_0.1.2 BiocGenerics_0.11.4 devtools_1.5
[7] slidify_0.4.5 knitr_1.6 BiocInstaller_1.15.5
loaded via a namespace (and not attached):
[1] compiler_3.2.0 digest_0.6.4 evaluate_0.5.5 formatR_0.10 httr_0.4
[6] markdown_0.7.2 memoise_0.2.1 RCurl_1.95-4.3 stats4_3.2.0 stringr_0.6.2
[11] tools_3.2.0 whisker_0.3-2 XVector_0.5.7 yaml_2.1.13
On Tue, Aug 26, 2014 at 2:00 AM, Dario Strbenac
<dstr7320 at uni.sydney.edu.au> wrote:
I am using the latest release version. I understand your recommendation about colData and will use it. -------------------------------------- Dario Strbenac PhD Student University of Sydney Camperdown NSW 2050 Australia
_______________________________________________ Bioc-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel