Mixed model (with interaction) for gene expression and iteration
Sorry for replying to my own emails, but I found partial solution to my problems in Pinheiro and Bates at: 2.4.1 - Likelihood Ratio Tests (p.83) to calculate logLik and p.values for random effects and 2.4.2 - Hypothesis tests for Fixed-Effect Terms (p.87) that is self-explanatory. Although I still don't understand how to calculate the "denominator degrees of freedom". Best, paolo PS. Off-topic quick question: is there currently a "out-of-the-box" solution to get credible intervals for random effects in mcmcsamp()? HPDinterval (from lme4 package) doesn't seem to work for that?
Paolo Innocenti wrote:
Dear Douglas, Rolf, Juan and list, thank you very much for your replies. I now got a good working model, and the use of refit and VarCorr will definitely help. I had a go with mcmcsamp(), and I must confirm that this approach is not feasible, both computationally and because if you get a "false convergence" for, say, 1 gene out of 20, it becomes impossible to go back and fix all the errors. So, the alternative approach seems more promising. If I understand correctly, you suggest to calculate a p-value for random effects out of the LRT (Likelihood ratio test), and use approximated DFs to calculate standard p-values for the fixed effects. I neved used this approach, so I appreciate if you can point me in the right direction. Random effects: I'd need to compare the this three model: m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line)) m1b <- lmer(Y1 ~ sex + (1|line)) m1c <- lmer(Y1 ~ sex) and get the effect of the interaction from m1a-m1b, and the effect of "line" from m1b-m1c. It that correct? Can anyone point me to any kind of documentation/examples to sort out the details? Fixed effects: I don't really know where to get the approximated degrees of freedom. Can you point me to an example? Thanks again for all the help. Eventually, I'll be really happy to share my experience/code when everything is sorted, even if I doubt I can add anything helpful. Best, paolo PS. I don't really understand what you mean by sobj<-summary(result) what object is your "result" here? Juan Pedro Steibel wrote:
Hello Paolo, We are also starting to use lmer for gene expression analysis (genetical genomics too) so here are my thoughts. Having two equivalent parameterizations, I would go with the computationally fastest one (a couple more miliseconds per model, easily add up to many hours when analyzing highly dimensional datasets). You can time the analysis for, say, 100 transcripts and go from there. ~note: other users commented on your models not being equivalent~ For p-values: You could use LRT to test for variance components. This is standard practice in genetic epidemiology: fit a model with and without the random effect in question, then compare the log (residual) likelihood ratio to a chi-square statistic. FDR can be used on top of that to account for multiple tests. Of course, now you have to fit three models (one null models for each VC), so your cpu time has just multiplied by almost 3. Definitely using refit and update will help with compute time when having so many models. I use sobj<-summary(result) as an intermediate step to get the info (although this may add to the computational burden and other suggestions you got may be faster and more efficient), then ask for slots: @REmat @coefs ...for getting estimates of variance components and fixed effects. @coefs gives you a t-statistic for fixed effects too... you could take a stab at an approximated df method and compute an (approximated) p-value. I know that doing so can attract a lot of criticism in this list, but when you are fitting a several million models (10000s of transcripts and 1000s of genomic positions as in my case), the mcmc approximation is (unfortunately) not computationally feasible. Hope this helps! JP Paolo Innocenti wrote:
Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray experiment
using lme4, since other specific software seems not suitable for my
purposes. I'll briefly describe the model and kindly ask for
suggestions on the model and the workflow I can use to get useful
results.
My response variable Y is gene expression levels for a given gene
(say g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due to
the genotype
- the variance component of the interaction = the genetic variation
for sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example
below).
Now, in the light of my needs (see above), which model makes it
easier to extract the components I need? Also, do they make different
assumptions - as different levels of independency among levels of
random factors?
I will need to be able to extract the variance component values in an
iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to assess
significant differential expression of genes (since I guess I can't
get p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of
help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 |
sex:line) AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line (Intercept)
0.0152993 0.123691 sexM 0.0046474 0.068172
0.194 Residual 0.0168238 0.129707 Number of obs:
120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line sexF 0.015299
0.12369 sexM 0.023227 0.15240 0.899 Residual
0.016824 0.12971 Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
Paolo Innocenti Department of Animal Ecology, EBC Uppsala University Norbyv?gen 18D 75236 Uppsala, Sweden