Skip to content

[Bioc-devel] DESeqDataSetFromMatrix Changes Column Names

4 messages · Dario Strbenac, Michael Love

#
Hello,

I have a matrix with column names. When creating a DESeqDataSet, the resulting matrix has changed column names to numbers. This causes problems if trying to create an ExpressionSet with it, for example, after regularised logarithm transformation, because the sample names no longer agree.

exprMatrix <- matrix(c(rnbinom(50, 1/0.15, mu = 30), rnbinom(50, 1/0.15, mu = 10)), ncol = 10)
colnames(exprMatrix) <- LETTERS[1:10]
exampleDDS <- DESeqDataSetFromMatrix(exprMatrix, data.frame(class = rep(c("Poor", "Good"), each = 5)), formula(~ class))

groupsTable <- data.frame(class = rep(c("Poor", "Good"), each = 5))
rownames(groupsTable) <- LETTERS[1:10]
exampleSet <- ExpressionSet(counts(exampleDDS), AnnotatedDataFrame(groupsTable))
Error in validObject(.Object) : 
  invalid class ?ExpressionSet? object: 1: sampleNames differ between assayData and phenoData
invalid class ?ExpressionSet? object: 2: sampleNames differ between phenoData and protocolData

--------------------------------------
Dario Strbenac
PhD Student
University of Sydney
Camperdown NSW 2050
Australia
1 day later
#
Hi Dario,

Which version are you using?

I think the column names of the matrices in the assays of
SummarizedExperiment are coming from the rownames of colData.

My priority is to avoid doubling the memory footprint in object creation. I
think preserving the colnames of the matrix was relevant to memory usage at
some point, but I forget exactly the details.

Mike
On Aug 25, 2014 10:35 PM, "Dario Strbenac" <dstr7320 at uni.sydney.edu.au>
wrote:

  
  
#
I am using the latest release version. I understand your recommendation about colData and will use it.
  
--------------------------------------
 Dario Strbenac
 PhD Student
 University of Sydney
 Camperdown NSW 2050
 Australia
#
hi Dario,

Here's some example behavior of SummarizedExperiment (here in devel).

The renaming behavior is coming from GenomicRanges. Anyway I can't
avoid the duplication of memory in the case of a conflict of colnames
of the matrix and the rownames of colData, unless I internally
overwrite the rownames of colData. But I don't think I would do this
because the standard is to let the colData take precedence.

watch the Vcells (used):

library(GenomicRanges)
gc()
m = matrix(rnorm(5e6),ncol=100,dimnames=list(1:5e4,paste0("a",1:100)))
gc() # 40 Mb or so taken by m

se = SummarizedExperiment(m)
gc() # no duplication after creating se

rm(se)
se = SummarizedExperiment(m,
colData=DataFrame(x=1:100,row.names=paste0("b",1:100)))
colnames(se) # colData takes precedence of colnames of se
colnames(assay(se)) # and of the colnames of m
gc() # note a duplication,
# because the colnames of the matrix in assay() were replaced

rm(se)
se = SummarizedExperiment(m, colData=DataFrame(x=1:100,row.names=colnames(m)))
gc() # no duplication, same names.
# so you can use this code to insist that
# the colnames of the DESeqDataSet come from the counts matrix

rm(m,se)
m = matrix(rnorm(5e6),ncol=100)
gc()
se = SummarizedExperiment(m, colData=DataFrame(x=1:100))
gc() # no duplication if m has no colnames going in


R Under development (unstable) (2014-06-05 r65862)
 Platform: x86_64-apple-darwin12.5.0 (64-bit)

 locale:
 [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

 attached base packages:
 [1] parallel  stats     graphics  grDevices datasets  utils     methods
 [8] base

 other attached packages:
 [1] GenomicRanges_1.17.35 GenomeInfoDb_1.1.18   IRanges_1.99.24
 [4] S4Vectors_0.1.2       BiocGenerics_0.11.4   devtools_1.5
 [7] slidify_0.4.5         knitr_1.6             BiocInstaller_1.15.5

 loaded via a namespace (and not attached):
  [1] compiler_3.2.0 digest_0.6.4   evaluate_0.5.5 formatR_0.10   httr_0.4
  [6] markdown_0.7.2 memoise_0.2.1  RCurl_1.95-4.3 stats4_3.2.0   stringr_0.6.2
 [11] tools_3.2.0    whisker_0.3-2  XVector_0.5.7  yaml_2.1.13

On Tue, Aug 26, 2014 at 2:00 AM, Dario Strbenac
<dstr7320 at uni.sydney.edu.au> wrote: