[Bioc-devel] Bug in rma: zero variance of probeset summaries across arrays
Thank you, Holger, Jim, Ben,
(1) this is indeed a peculiar feature of the medianpolish part of RMA. I
apologize to Ben for using the harsh word "bug" inappropriately in this
context. I was misled by observations that using different normalisation
or background correction methods seemed to "fix" the problem, and had
forgotten about those previous threads. Sorry.
I wonder whether anyone has done research on how to improve on
medianpolish for small datasets and why people haven't adapted it.
(2) One further potential problem remains, and I hope that bringing this
up might help to make sure that rma (which is obviously a core component
of Bioconductor) is robust even under the most adverse circumstances
(e.g. myself :). With the script below, I get a core dump. This is on a
freshly checked out and built R (from today) and all packages then
installed from scratch (sessionInfo as in previous mail below). It does
not seem to happen on R-2.8.1. and Bioc release 2.3., and even on R-dev,
whether it happens seems to depend on the history of the R session. One
hypothesis is that this could be related to the "destructive=TRUE"
parameter of rma?
Can anyone reproduce this?
Best wishes
Wolfgang
library("affy")
library("genefilter")
options(error=recover)
files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL", "PP_T2_2c.CEL")
#celdir = tempdir()
#for(f in files)
# download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
# file.path(celdir, f))
celdir="."
a = ReadAffy(filenames=file.path(celdir, files))
## Define a CDF with one probeset per probe
testcdf = new.env(hash=TRUE)
for(i in seq_len(nrow(exprs(a))))
assign(paste(i), matrix(i, ncol=1, nrow=1), envir=testcdf)
a at cdfName = "testcdf"
x = rma(a)
----------------------
Background correcting
Normalizing
Calculating Expression
Error in `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL", :
attempt to set colnames on object with less than two dimensions
Enter a frame number, or 0 to exit
1: source("rma2.R")
2: eval.with.vis(ei, envir)
3: eval.with.vis(expr, envir, enclos)
4: rma(a)
5: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",
"PP_T2_1.CEL"
Selection: 4
Called from: eval(expr, envir, enclos)
Browse[1]> ls()
[1] "*tmp*" "background" "bgversion" "cols" "destructive"
[6] "exprs" "ngenes" "normalize" "object" "pNList"
[11] "rows" "subset" "verbose"
Browse[1]> str(exprs)
........ long dump of text ....
496: stop("attempt to set colnames on object with less than two dimensions")
497: `colnames<-`(`*tmp*`, value = c("PP_T1_1.CEL", "PP_T1_2.CEL",
"PP_T2_1.CEL", "PP_T2_2c.CEL"))
498: rma(a)
499: eval.with.vis(expr, envir, enclos)
500: eval.with.vis(ei, envir)
501: source("rma2.R")
Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:
James W. MacDonald wrote:
Hi Holger, Holger Schwender wrote:
Hi Wolfgang, the difference between the two outputs x1 and x2 are due to that in x1 the probe intensities are background-corrected first and then normalized, while in x2 they are first normalized and then background-corrected. The constant values are a feature of median polish: If median polish is applied to four or six samples, then this can lead to constant values for some of the probe sets. This has been mentioned years ago by Jim (I couldn't find his original email on this topic in the mailing list archive).
Actually this only occurs with a small, odd number of samples, not even. Well, it could happen for a small number of probesets if the n/2 and n/2 + 1 ordered values were identical, but it is much more common in the case of 3 or 5 samples. Best, Jim
Best, Holger -------- Original-Nachricht --------
Datum: Tue, 03 Feb 2009 15:38:06 +0100 Von: Wolfgang Huber <huber at ebi.ac.uk> An: Bioconductor Developers <bioc-devel at stat.math.ethz.ch> Betreff: [Bioc-devel] Bug in rma: zero variance of probeset summaries across arrays
Hi,
I am getting puzzling results from the rma function in the affy
package, and furthermore they are different depending on whether
quantile normalisation is called within rma or explicitely. The
puzzling part is that some probesets, in a perfectly ordinary
experiment with 4 different arrays, are summarized with exactly
identical values across all arrays (i.e. zero variance); while the
underlying probe level data for these probesets is variable and
sits right in the middle of the dynamic range (i.e. not at any of
the extremes).
The code: ---------
library("affy") library("genefilter") library("preprocessCore")
options(error=recover)
files = c("PP_T1_1.CEL", "PP_T1_2.CEL", "PP_T2_1.CEL",
"PP_T2_2c.CEL")
for(f in files)
download.file(file.path("http://www.ebi.ac.uk/~huber/pub", f),
file.path(tempdir(), f))
a = ReadAffy(filenames=file.path(tempdir(), files))
x1 = rma(a)
aq = a exprs(aq) = normalize.quantiles(exprs(aq)) x2 = rma(aq,
normalize=FALSE)
probeVars = function(x) { r = rowSds(exprs(x)) ps =
names(which(r==0)) cat(sprintf("%s: %d probes with zero
variance.\n", deparse(substitute(x)), length(ps))) return(ps) }
probeVars(a) ps = probeVars(x1) probeVars(x2)
print(exprs(a)[yeast2cdf[[ps[1]]][,1],])
-----------------------------------------------------------------------------
The output: ----------- Background correcting Normalizing Calculating
Expression Background correcting Calculating Expression
a: 0 probes with zero variance. x1: 3 probes with zero variance. x2:
0 probes with zero variance.
PP_T1_1.CEL PP_T1_2.CEL PP_T2_1.CEL PP_T2_2c.CEL 35244 185
236 269 253 198153 226 314
320 213 48500 375 370 408
373 109094 125 157 201 175 144686
107 131 153 100 214110 181
296 307 234 47302 355 430
397 282 108656 117 130 150
118 147528 158 144 159 135 237691
208 244 253 169 229910 94
138 147 133
sessionInfo()
R version 2.9.0 Under development (unstable) (2009-02-03 r47829) x86_64-unknown-linux-gnu locale: LC_CTYPE=it_IT.UTF-8;LC_NUMERIC=C;LC_TIME=it_IT.UTF-8;LC_COLLATE=it_IT.UTF-8;LC_MONETARY=C;LC_MESSAGES=it_IT.UTF-8;LC_PAPER=it_IT.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=it_IT.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] yeast2cdf_2.3.0 preprocessCore_1.5.3 genefilter_1.23.1 [4] affy_1.21.7 Biobase_2.3.10 fortunes_1.3-6 loaded via a namespace (and not attached): [1] affyio_1.11.3 annotate_1.21.3 AnnotationDbi_1.5.14 [4] DBI_0.2-4 RSQLite_0.7-1 splines_2.9.0 [7] survival_2.34-1 tools_2.9.0 xtable_1.5-4 Best wishes Wolfgang PS While trying to dissect this, I also experimented with constructing special CDF environments (e.g. one that had one 'probeset' for each probe), and I noted that repeatedly the C code of rma caused core dumps. Perhaps a better response to unexpected shapes of CDF environments would be to issue an error message. I can provide details if interested. ---------------------------------------------------- Wolfgang Huber, EMBL-EBI, http://www.ebi.ac.uk/huber
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
--
_______________________________________________ Bioc-devel at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
---------------------------------------------------- Wolfgang Huber EMBL-EBI http://www.ebi.ac.uk/huber