Skip to content
Prev 7158 / 20628 Next

factor-specific variances in lmer

Dear Paul, John, and other friendly R-sig-ME'ers,

I am glad you find this issue interesting enough for a general discussion; let me elaborate a bit more on the data structure.
Paul, John - please forget for a moment that super-simple model I talked about in the previous message. Let's start by considering the general gene expression problem.

The goal is to determine how expression level of a gene varies across a number of biological conditions. Each condition is represented by several samples. Typically we look at many genes at once, but we are not interested in absolute values of expression or comparisons between genes. So the only factor of interest here is the interaction term, gene:condition.

Other influences that we could have in the model is the fixed effect of gene (the gene's  average expression across samples), fixed effect of condition (reflecting consistent differences in quantity or quality of the samples between conditions), and a random effect of sample, describing the random variation of quantity and quality of biological material across the samples (note that condition and sample effects are confounded). The random sample effect, representing simply how much workable material was in the sample,  influences all genes in the sample in the same way, so there would be no gene:sample interactions. So the full model looks like this:
expression~gene+condition+gene:condition + (1|sample),
and we are only interested in gene:condition part, the rest is nuisance.

Now, the number of genes we are analyzing in this particular method (called quantitative PCR, or qPCR)  is typically small, not exceeding 10. Often we are interested in studying expression of just two-three genes across a huge lot of samples. Since there are so few of the genes, in many cases their average change between conditions will not average to zero. This deviation from zero-average will get absorbed in the condition effect (or, if we try to be tricky and not include it in the model, it will go into the random sample effect) and will be lost from  our view - even though it is biologically valuable. This makes the straightforward application of the full model rather problematic  - which is one of the points I am trying to make in the paper.

The historical solution In qPCR practice was to include a gene the expression of which does not change across analyzed conditions, calculate the sample effects directly as the deviations of this "control gene" expression from the mean, and subtract them away. Then one can dispose of the full many-genes model and run analysis on a gene-by-gene basis,  lm(expression~condition) in the simplest case.

Now, it became recognized that there is no such thing as a perfectly stable control gene. So the field switched to use of several (3 to 5) control genes, which are (i) almost, although not perfectly, stable, and (ii) average to near-zero if their change across conditions in calculated. Their average expression in samples is then used in the same way as the expression of a single control gene - to calculate sample effects, subtract them, and proceed with gene-by-gene analysis.

All is good, but how to identify good control genes for a particular experiment? Different methods exist, but what I propose it to go back to the full model, considering only the candidate control genes this time (no actual "response" genes). We must find genes that show least gene:condition effect. I ran a bunch of simulations and it seems that the most robust inference can be achieved not form the full-model fitting, but in a rather funny way where we  (i) represent expression of each gene as  log(fold-change relative to the mean expression of that gene across samples), which allows us to drop the fixed "gene" effect, (ii) the model is fitted that is oblivious to conditions: expression~(1|sample) , and gene-wise moments of the residuals form this model are compared to see which gene is most variable across and within conditions. 

So the problem is, the reviewer says that in all my models I should explicitly model gene-specific variances of residuals and effects (if any). These variances are indeed different (and finding genes with least variances is one of our tasks), but must we really formulate our models in such a way?

apologies for the long rant

Misha
On Dec 14, 2011, at 9:17 PM, Paul Johnson wrote: