factor-specific variances in lmer

2 messages · Paul Johnson, mikhail matz

Wed, Dec 14, 2011 3:17 AM #

Dear Misha,

I've copied this to the R-sig-ME list as you're much more likely to get a helpful (and more expert) response, and the problem might be of general interest.

I don't understand the problem well enough to know if it's a reasonable request. It would help to have a better idea of the structure of the data - it seems that sample and gene are crossed. If you're right that gene doesn't need to be in the model (i.e. there's no difference in the residual variances or the means between the levels of the factor gene), then that should be testable by adding gene to the model.

John, I don't understand how adding "... + (gene | sample)" allows the residual variances to vary. This would allow the inter-gene differences in mean expression to vary between samples, but I understand the reviewer as requesting that the residual variances be allowed to differ between genes.

Best wishes,
Paul



-----Original Message-----
From: mikhail matz [mailto:matz at mail.utexas.edu]
Sent: 14 December 2011 08:01
To: Paul Johnson; John Maindonald
Subject: Re: [R-sig-ME] factor-specific variances in lmer

Dear Paul and John -

thanks a lot for your responses. I think I should try to make myself a bit more clear... The main model I am fitting is extremely simple: expression~(1|sample). The reviewer says the residuals from that model should have explicitly modeled gene-specific variances (and gene is not even a factor in this model). My question is, is it even reasonable to demand such a thing?..

Misha


-----Original Message-----
From: John Maindonald [mailto:john.maindonald at anu.edu.au]
Sent: 13 December 2011 00:58
To: Paul Johnson
Cc: mikhail matz; r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] factor-specific variances in lmer

Actually once can fit gene specific variances using lmer.
The random effects term takes the form +(gene | whatever) [gene must be a factor]

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On Dec 13, 2011, at 9:46 AM, Paul Johnson wrote:

Dear Misha,

Why not just fit both models (heteroscedastic and homoscedastic) and compare them? You can do this with lme, but not lmer afaik.

E.g.
       update(homoscedastic.model,weights=varIdent(form=~1|gene))

will refit the homoscedastic model with a separate variance for each level of the factor "gene", followed by model comparison via

       anova(heteroscedastic.model,homoscedastic.model).

A drawback is that it's trickier and slower to fit crossed random effects in lme, which I guess you have - genes crossed with samples? Pinheiro & Bates (Mixed Effects Models in S and S-Plus, 2000, p163-167) show how to do this using pdBlocked.

Best wishes,
Paul


-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of mikhail matz
Sent: 12 December 2011 23:07
To: r-sig-mixed-models at r-project.org
Subject: [R-sig-ME] factor-specific variances in lmer

Dear colleagues,

A reviewer of my manuscript keeps bringing up this issue, and I ran out ideas and expertise trying to address his concern.
I am looking at application of lmer to analyze gene expression. Typical gene expression problem involves many samples across different conditions, each sample containing measurements for the same collection of genes. Mixed modeling is primarily needed because of the random sample effects (different amount of biological material in each sample, the effect therefore is the same for all genes in the sample). The issue is that both the gene-specific effects of conditions and the residuals would tend to have their own variances. The reviewer requires that the model should explicitly include such gene-specific variances as parameters. He cites a paper that uses SAS to formulate such models. Here is what he says exactly:

"Residual and/or sample*gene effects
should have gene-specific variances. This has been observed for a number of dataset (Steibel et al., 2009) and can be proved to be the case for the dataset used in this model. Ignoring heterogeneous variances can potentially lead to wrong inferences. Authors should made readers aware of this and demonstrate why in this particular dataset homogeneous variance could be safely assumed. If that can't be proved, then the methods presented here should be modified to account for this."

Now, lmer does not even include such an option, and honestly I never before came across such a concern in linear modeling. I wonder why, though. As far as I could gather from Doug's lmer algorithm paper with my rudimentary math training was that this issue is taken care of by the fitting algorithm. Here is what I wrote in rebuttal:

"It is true that a straightforward linear fitting based on least-squares would be affected by unequal variances depending on the values of factors (in our case, genes): the most highly variable gene will have the most leverage in determining the resulting parameter estimates.  It is also true that, for a general case, it is unrealistic to assume that the variances of expression of all genes would be equal. Fortunately, there are efficient algorithmic solutions to this problem.
In particular, the lmer function used here relies on "penalized, iteratively reweighted, least squares" (PIRLS) algorithm, which is specifically designed to counter the artifacts induced by this type of heteroscedasticity. In other words, lmer assumes by default that the variances are factor-specific, and does not require specification and estimation of their exact parameters to deal with it."

As you might guess from the fact that I am still bugging you with all this, the reviewer was not satisfied by this reply:

"I insist that the use of a homoskedastic model under these circumstances (heterogeneous
variances) is incorrect. The author says that PIRLS algorithm is robust to heteroskedasticity and it cites a technical document from Dr. Douglas Bates (http://cran.rproject.
org/web/packages/lme4/vignettes/Theory.pdf). The problem is that the cited document does not support the claim made by the author. Maybe the confusion comed from section 2.5 of the cited document that says: "(It may look as if we have missed the dependence on sigma on the left-hand side but it turns out that the scale parameter does not affect the location of the optimal values of quantities in the linear predictor.)". The fact that "the scale parameter does not directly affect the location of the optimal values of quantities in the linear predictor" does not mean the "algorithm is robust to heteroskedasticity". Actually, if you look at equation 36-39, section 4.2 of the technical document, it is clear that the updating equations do not depend on the absolute value of the variances, but on variance ratios. If the variances change freely across genes, so will their ratios. This is a very well known fact. For example, borrowing from the microarray literature, when multiple genes are analyzed, heterogeneous variances should be explicitly modeled [2]. In summary, a fitting algorithm can not make up for a wrongly specified model in general, and this is particularly true for the linear mixed model and Fisher scoring (PIRLS) algorithms. Consequently, the answer provided and the corresponding statement in the paper is originated in a clear misinterpretation of the companion documentation to lmer. In conclusion, the statement in lines 112-115 is incorrect."

Now, this already goes way beyond my level of expertise. I would tremendously appreciate your comment and advice as to how I am to deal with this.

best regards,

Misha

Mikhail V. Matz
Assistant Professor
University of Texas at Austin
Integrative Biology Section
1 University station C0930
Austin, TX 78712
phone 512-992-8086 cell, 512-475-6424 lab fax 512-471-3878 web http://www.bio.utexas.edu/research/matz_lab

_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


The University of Glasgow, charity number SC004401

The University of Glasgow, charity number SC004401

mikhail matz

Wed, Dec 14, 2011 4:55 PM #

Dear Paul, John, and other friendly R-sig-ME'ers,

I am glad you find this issue interesting enough for a general discussion; let me elaborate a bit more on the data structure.
Paul, John - please forget for a moment that super-simple model I talked about in the previous message. Let's start by considering the general gene expression problem.

The goal is to determine how expression level of a gene varies across a number of biological conditions. Each condition is represented by several samples. Typically we look at many genes at once, but we are not interested in absolute values of expression or comparisons between genes. So the only factor of interest here is the interaction term, gene:condition.

Other influences that we could have in the model is the fixed effect of gene (the gene's average expression across samples), fixed effect of condition (reflecting consistent differences in quantity or quality of the samples between conditions), and a random effect of sample, describing the random variation of quantity and quality of biological material across the samples (note that condition and sample effects are confounded). The random sample effect, representing simply how much workable material was in the sample, influences all genes in the sample in the same way, so there would be no gene:sample interactions. So the full model looks like this:
expression~gene+condition+gene:condition + (1|sample),
and we are only interested in gene:condition part, the rest is nuisance.

Now, the number of genes we are analyzing in this particular method (called quantitative PCR, or qPCR) is typically small, not exceeding 10. Often we are interested in studying expression of just two-three genes across a huge lot of samples. Since there are so few of the genes, in many cases their average change between conditions will not average to zero. This deviation from zero-average will get absorbed in the condition effect (or, if we try to be tricky and not include it in the model, it will go into the random sample effect) and will be lost from our view - even though it is biologically valuable. This makes the straightforward application of the full model rather problematic - which is one of the points I am trying to make in the paper.

The historical solution In qPCR practice was to include a gene the expression of which does not change across analyzed conditions, calculate the sample effects directly as the deviations of this "control gene" expression from the mean, and subtract them away. Then one can dispose of the full many-genes model and run analysis on a gene-by-gene basis, lm(expression~condition) in the simplest case.

Now, it became recognized that there is no such thing as a perfectly stable control gene. So the field switched to use of several (3 to 5) control genes, which are (i) almost, although not perfectly, stable, and (ii) average to near-zero if their change across conditions in calculated. Their average expression in samples is then used in the same way as the expression of a single control gene - to calculate sample effects, subtract them, and proceed with gene-by-gene analysis.

All is good, but how to identify good control genes for a particular experiment? Different methods exist, but what I propose it to go back to the full model, considering only the candidate control genes this time (no actual "response" genes). We must find genes that show least gene:condition effect. I ran a bunch of simulations and it seems that the most robust inference can be achieved not form the full-model fitting, but in a rather funny way where we (i) represent expression of each gene as log(fold-change relative to the mean expression of that gene across samples), which allows us to drop the fixed "gene" effect, (ii) the model is fitted that is oblivious to conditions: expression~(1|sample) , and gene-wise moments of the residuals form this model are compared to see which gene is most variable across and within conditions.

So the problem is, the reviewer says that in all my models I should explicitly model gene-specific variances of residuals and effects (if any). These variances are indeed different (and finding genes with least variances is one of our tasks), but must we really formulate our models in such a way?

apologies for the long rant

Misha

On Dec 14, 2011, at 9:17 PM, Paul Johnson wrote:

Dear Misha,

I've copied this to the R-sig-ME list as you're much more likely to get a helpful (and more expert) response, and the problem might be of general interest.

I don't understand the problem well enough to know if it's a reasonable request. It would help to have a better idea of the structure of the data - it seems that sample and gene are crossed. If you're right that gene doesn't need to be in the model (i.e. there's no difference in the residual variances or the means between the levels of the factor gene), then that should be testable by adding gene to the model.

John, I don't understand how adding "... + (gene | sample)" allows the residual variances to vary. This would allow the inter-gene differences in mean expression to vary between samples, but I understand the reviewer as requesting that the residual variances be allowed to differ between genes.

Best wishes,
Paul



-----Original Message-----
From: mikhail matz [mailto:matz at mail.utexas.edu]
Sent: 14 December 2011 08:01
To: Paul Johnson; John Maindonald
Subject: Re: [R-sig-ME] factor-specific variances in lmer

Dear Paul and John -

thanks a lot for your responses. I think I should try to make myself a bit more clear... The main model I am fitting is extremely simple: expression~(1|sample). The reviewer says the residuals from that model should have explicitly modeled gene-specific variances (and gene is not even a factor in this model). My question is, is it even reasonable to demand such a thing?..

Misha


-----Original Message-----
From: John Maindonald [mailto:john.maindonald at anu.edu.au]
Sent: 13 December 2011 00:58
To: Paul Johnson
Cc: mikhail matz; r-sig-mixed-models at r-project.org
Subject: Re: [R-sig-ME] factor-specific variances in lmer

Actually once can fit gene specific variances using lmer.
The random effects term takes the form +(gene | whatever) [gene must be a factor]

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Mathematics & Its Applications, Room 1194, John Dedman Mathematical Sciences Building (Building 27) Australian National University, Canberra ACT 0200.
http://www.maths.anu.edu.au/~johnm

On Dec 13, 2011, at 9:46 AM, Paul Johnson wrote:

Dear Misha,

Why not just fit both models (heteroscedastic and homoscedastic) and compare them? You can do this with lme, but not lmer afaik.

E.g.
      update(homoscedastic.model,weights=varIdent(form=~1|gene))

will refit the homoscedastic model with a separate variance for each level of the factor "gene", followed by model comparison via

      anova(heteroscedastic.model,homoscedastic.model).

A drawback is that it's trickier and slower to fit crossed random effects in lme, which I guess you have - genes crossed with samples? Pinheiro & Bates (Mixed Effects Models in S and S-Plus, 2000, p163-167) show how to do this using pdBlocked.

Best wishes,
Paul


-----Original Message-----
From: r-sig-mixed-models-bounces at r-project.org [mailto:r-sig-mixed-models-bounces at r-project.org] On Behalf Of mikhail matz
Sent: 12 December 2011 23:07
To: r-sig-mixed-models at r-project.org
Subject: [R-sig-ME] factor-specific variances in lmer

Dear colleagues,

A reviewer of my manuscript keeps bringing up this issue, and I ran out ideas and expertise trying to address his concern.
I am looking at application of lmer to analyze gene expression. Typical gene expression problem involves many samples across different conditions, each sample containing measurements for the same collection of genes. Mixed modeling is primarily needed because of the random sample effects (different amount of biological material in each sample, the effect therefore is the same for all genes in the sample). The issue is that both the gene-specific effects of conditions and the residuals would tend to have their own variances. The reviewer requires that the model should explicitly include such gene-specific variances as parameters. He cites a paper that uses SAS to formulate such models. Here is what he says exactly:

"Residual and/or sample*gene effects
should have gene-specific variances. This has been observed for a number of dataset (Steibel et al., 2009) and can be proved to be the case for the dataset used in this model. Ignoring heterogeneous variances can potentially lead to wrong inferences. Authors should made readers aware of this and demonstrate why in this particular dataset homogeneous variance could be safely assumed. If that can't be proved, then the methods presented here should be modified to account for this."

Now, lmer does not even include such an option, and honestly I never before came across such a concern in linear modeling. I wonder why, though. As far as I could gather from Doug's lmer algorithm paper with my rudimentary math training was that this issue is taken care of by the fitting algorithm. Here is what I wrote in rebuttal:

"It is true that a straightforward linear fitting based on least-squares would be affected by unequal variances depending on the values of factors (in our case, genes): the most highly variable gene will have the most leverage in determining the resulting parameter estimates.  It is also true that, for a general case, it is unrealistic to assume that the variances of expression of all genes would be equal. Fortunately, there are efficient algorithmic solutions to this problem.
In particular, the lmer function used here relies on "penalized, iteratively reweighted, least squares" (PIRLS) algorithm, which is specifically designed to counter the artifacts induced by this type of heteroscedasticity. In other words, lmer assumes by default that the variances are factor-specific, and does not require specification and estimation of their exact parameters to deal with it."

As you might guess from the fact that I am still bugging you with all this, the reviewer was not satisfied by this reply:

"I insist that the use of a homoskedastic model under these circumstances (heterogeneous
variances) is incorrect. The author says that PIRLS algorithm is robust to heteroskedasticity and it cites a technical document from Dr. Douglas Bates (http://cran.rproject.
org/web/packages/lme4/vignettes/Theory.pdf). The problem is that the cited document does not support the claim made by the author. Maybe the confusion comed from section 2.5 of the cited document that says: "(It may look as if we have missed the dependence on sigma on the left-hand side but it turns out that the scale parameter does not affect the location of the optimal values of quantities in the linear predictor.)". The fact that "the scale parameter does not directly affect the location of the optimal values of quantities in the linear predictor" does not mean the "algorithm is robust to heteroskedasticity". Actually, if you look at equation 36-39, section 4.2 of the technical document, it is clear that the updating equations do not depend on the absolute value of the variances, but on variance ratios. If the variances change freely across genes, so will their ratios. This is a very well known fact. For example, borrowing from the microarray literature, when multiple genes are analyzed, heterogeneous variances should be explicitly modeled [2]. In summary, a fitting algorithm can not make up for a wrongly specified model in general, and this is particularly true for the linear mixed model and Fisher scoring (PIRLS) algorithms. Consequently, the answer provided and the corresponding statement in the paper is originated in a clear misinterpretation of the companion documentation to lmer. In conclusion, the statement in lines 112-115 is incorrect."

Now, this already goes way beyond my level of expertise. I would tremendously appreciate your comment and advice as to how I am to deal with this.

best regards,

Misha

Mikhail V. Matz
Assistant Professor
University of Texas at Austin
Integrative Biology Section
1 University station C0930
Austin, TX 78712
phone 512-992-8086 cell, 512-475-6424 lab fax 512-471-3878 web http://www.bio.utexas.edu/research/matz_lab

_______________________________________________
R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


The University of Glasgow, charity number SC004401