Skip to content

factor-specific variances in lmer

2 messages · Paul Johnson, mikhail matz

#
Dear Misha,

I've copied this to the R-sig-ME list as you're much more likely to get a helpful (and more expert) response, and the problem might be of general interest.

I don't understand the problem well enough to know if it's a reasonable request. It would help to have a better idea of the structure of the data - it seems that sample and gene are crossed. If you're right that gene doesn't need to be in the model (i.e. there's no difference in the residual variances or the means between the levels of the factor gene), then that should be testable by adding gene to the model.

John, I don't understand how adding "... + (gene | sample)" allows the residual variances to vary. This would allow the inter-gene differences in mean expression to vary between samples, but I understand the reviewer as requesting that the residual variances be allowed to differ between genes.

Best wishes,
Paul



-----Original Message-----
From: mikhail matz [mailto:matz at mail.utexas.edu]
Sent: 14 December 2011 08:01
To: Paul Johnson; John Maindonald
Subject: Re: [R-sig-ME] factor-specific variances in lmer

Dear Paul and John -

thanks a lot for your responses. I think I should try to make myself a bit more clear... The main model I am fitting is extremely simple: expression~(1|sample). The reviewer says the residuals from that model should have explicitly modeled gene-specific variances (and gene is not even a factor in this model). My question is, is it even reasonable to demand such a thing?..

Misha


-----Original Message-----
From: John Maindonald [mailto:john.maindonald at anu.edu.au]
Sent: 13 December 2011 00:58
To: Paul Johnson
Cc: mikhail matz; r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] factor-specific variances in lmer

Actually once can fit gene specific variances using lmer.
The random effects term takes the form +(gene | whatever) [gene must be a factor]

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm
On Dec 13, 2011, at 9:46 AM, Paul Johnson wrote:

            
The University of Glasgow, charity number SC004401
#
Dear Paul, John, and other friendly R-sig-ME'ers,

I am glad you find this issue interesting enough for a general discussion; let me elaborate a bit more on the data structure.
Paul, John - please forget for a moment that super-simple model I talked about in the previous message. Let's start by considering the general gene expression problem.

The goal is to determine how expression level of a gene varies across a number of biological conditions. Each condition is represented by several samples. Typically we look at many genes at once, but we are not interested in absolute values of expression or comparisons between genes. So the only factor of interest here is the interaction term, gene:condition.

Other influences that we could have in the model is the fixed effect of gene (the gene's  average expression across samples), fixed effect of condition (reflecting consistent differences in quantity or quality of the samples between conditions), and a random effect of sample, describing the random variation of quantity and quality of biological material across the samples (note that condition and sample effects are confounded). The random sample effect, representing simply how much workable material was in the sample,  influences all genes in the sample in the same way, so there would be no gene:sample interactions. So the full model looks like this:
expression~gene+condition+gene:condition + (1|sample),
and we are only interested in gene:condition part, the rest is nuisance.

Now, the number of genes we are analyzing in this particular method (called quantitative PCR, or qPCR)  is typically small, not exceeding 10. Often we are interested in studying expression of just two-three genes across a huge lot of samples. Since there are so few of the genes, in many cases their average change between conditions will not average to zero. This deviation from zero-average will get absorbed in the condition effect (or, if we try to be tricky and not include it in the model, it will go into the random sample effect) and will be lost from  our view - even though it is biologically valuable. This makes the straightforward application of the full model rather problematic  - which is one of the points I am trying to make in the paper.

The historical solution In qPCR practice was to include a gene the expression of which does not change across analyzed conditions, calculate the sample effects directly as the deviations of this "control gene" expression from the mean, and subtract them away. Then one can dispose of the full many-genes model and run analysis on a gene-by-gene basis,  lm(expression~condition) in the simplest case.

Now, it became recognized that there is no such thing as a perfectly stable control gene. So the field switched to use of several (3 to 5) control genes, which are (i) almost, although not perfectly, stable, and (ii) average to near-zero if their change across conditions in calculated. Their average expression in samples is then used in the same way as the expression of a single control gene - to calculate sample effects, subtract them, and proceed with gene-by-gene analysis.

All is good, but how to identify good control genes for a particular experiment? Different methods exist, but what I propose it to go back to the full model, considering only the candidate control genes this time (no actual "response" genes). We must find genes that show least gene:condition effect. I ran a bunch of simulations and it seems that the most robust inference can be achieved not form the full-model fitting, but in a rather funny way where we  (i) represent expression of each gene as  log(fold-change relative to the mean expression of that gene across samples), which allows us to drop the fixed "gene" effect, (ii) the model is fitted that is oblivious to conditions: expression~(1|sample) , and gene-wise moments of the residuals form this model are compared to see which gene is most variable across and within conditions. 

So the problem is, the reviewer says that in all my models I should explicitly model gene-specific variances of residuals and effects (if any). These variances are indeed different (and finding genes with least variances is one of our tasks), but must we really formulate our models in such a way?

apologies for the long rant

Misha
On Dec 14, 2011, at 9:17 PM, Paul Johnson wrote: