Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray experiment
using lme4, since other specific software seems not suitable for my
purposes. I'll briefly describe the model and kindly ask for suggestions
on the model and the workflow I can use to get useful results.
My response variable Y is gene expression levels for a given gene (say
g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due to
the genotype
- the variance component of the interaction = the genetic variation for
sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example
below).
Now, in the light of my needs (see above), which model makes it easier
to extract the components I need? Also, do they make different
assumptions - as different levels of independency among levels of random
factors?
I will need to be able to extract the variance component values in an
iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to assess
significant differential expression of genes (since I guess I can't get
p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML
Formula: Y1 ~ sex + (1 | line) + (1 | sex:line)
AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML
Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr
line (Intercept) 0.0152993 0.123691
sexM 0.0046474 0.068172 0.194
Residual 0.0168238 0.129707
Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML
Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr
line sexF 0.015299 0.12369
sexM 0.023227 0.15240 0.899
Residual 0.016824 0.12971
Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyv?gen 18D
75236 Uppsala, Sweden
I am thinking about fitting a mixed model for a microarray experiment using
lme4, since other specific software seems not suitable for my purposes. I'll
briefly describe the model and kindly ask for suggestions on the model and
the workflow I can use to get useful results.
My response variable Y is gene expression levels for a given gene (say g_i)
from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of sex), in
the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due to the
genotype
- the variance component of the interaction = the genetic variation for
sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example below).
Not really. m2 and m3 are different parameterizations of the same
model but m1 is different. In models m2 and m3 the random effects for
F and for M are allowed to be correlated within a line. Thus there
are three parameters determining the variance-covariance of the random
effects. In model m1 the unconditional distribution of the random
effects for the sex:line interaction is independent of the
distribution of the random effects for line.
Now, in the light of my needs (see above), which model makes it easier to
extract the components I need? Also, do they make different assumptions - as
different levels of independency among levels of random factors?
I would use either m1 or m3 because I find the correlation structure
in m3 easier to understand than that of m2.
I will need to be able to extract the variance component values in an
iterative process (i have 18.000 genes): is VarCorr() the way to go?
I think so. I would definitely look into using refit to update the
fitted model rather than starting from scratch with each of the genes.
I would use the idiom
unlist(VarCorr(m1))
sex:line line
0.002323722 0.016939242
to get the estimated variances.
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to assess
significant differential expression of genes (since I guess I can't get
p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 |
sex:line) ? ?AIC ? BIC logLik deviance REMLdev
?-91.13 -77.2 ?50.57 ? -111.1 ?-101.1
Random effects:
?Groups ? Name ? ? ? ?Variance ?Std.Dev.
?sex:line (Intercept) 0.0023237 0.048205
?line ? ? (Intercept) 0.0169393 0.130151
?Residual ? ? ? ? ? ? 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
? ? ? ? ? ?Estimate Std. Error t value
(Intercept) 11.45977 ? ?0.03955 ?289.72
sexM ? ? ? ? 0.52992 ? ?0.02951 ? 17.96
Correlation of Fixed Effects:
? ? (Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line) ?AIC
?BIC logLik deviance REMLdev
?-90 -73.27 ? ? 51 ? -112.1 ? ?-102
Random effects:
?Groups ? Name ? ? ? ?Variance ?Std.Dev. Corr ??line ? ? (Intercept)
0.0152993 0.123691 ? ? ? ? ? ? ? ?sexM ? ? ? ?0.0046474 0.068172 0.194
?Residual ? ? ? ? ? ? 0.0168238 0.129707 ? ? ? Number of obs: 120, groups:
line, 15
Fixed effects:
? ? ? ? ? ?Estimate Std. Error t value
(Intercept) 11.45977 ? ?0.03606 ? 317.8
sexM ? ? ? ? 0.52992 ? ?0.02951 ? ?18.0
Correlation of Fixed Effects:
? ? (Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line) ?AIC
?BIC logLik deviance REMLdev
?-90 -73.27 ? ? 51 ? -112.1 ? ?-102
Random effects:
?Groups ? Name Variance Std.Dev. Corr ??line ? ? sexF 0.015299 0.12369
? ? ? ? ? ? ?sexM 0.023227 0.15240 ?0.899 ?Residual ? ? ?0.016824 0.12971
? ? ?Number of obs: 120, groups: line, 15
Fixed effects:
? ? ? ? ? ?Estimate Std. Error t value
(Intercept) 11.45977 ? ?0.03606 ? 317.8
sexM ? ? ? ? 0.52991 ? ?0.02951 ? ?18.0
Correlation of Fixed Effects:
? ? (Intr)
sexM -0.161
--
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyv?gen 18D
75236 Uppsala, Sweden
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example
below).
<snip>
I don't think that the first two models are indeed the same model.
My understanding --- which is very limited --- is that
lmer(Y1 ~ sex + (1|line) + (1|sex:line))
gives a model in which the effect of line j on sex 1 (say X_1j) and
and the effect of line j on sex 2 (say X_2j) are uncorrelated. I.e.
X_1j and X_2j have covariance matrix of the form sigma^2 * I, where I
is the 2 x 2 identity. Thus one random effect parameter is
contributed.
In contrast,
lmer(Y1 ~ sex + (sex|line))
gives a model in which correlation between X_1j and X_2j, i.e. their
covariance matrix is a ``general'' 2 x 2 positive definite matrix.
Thus three random effect parameters are contributed.
See
http://www.nabble.com/lme-nesting-interaction-advice-
td17131600i20.html#a17213604
for the posting from Doug Bates upon which I am basing my
understanding.
Compare:
lmer(score ~ Machine + (1|Worker/Machine), Machines)
and
lmer(score ~ Machine + (Machine|Worker), Machines)
with your proposed models.
I hope that I have not misinterpreted Prof. Bates' explanation.
cheers,
Rolf Turner
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
Hello Paolo,
We are also starting to use lmer for gene expression analysis (genetical
genomics too) so here are my thoughts.
Having two equivalent parameterizations, I would go with the
computationally fastest one (a couple more miliseconds per model, easily
add up to many hours when analyzing highly dimensional datasets). You
can time the analysis for, say, 100 transcripts and go from there.
~note: other users commented on your models not being equivalent~
For p-values:
You could use LRT to test for variance components.
This is standard practice in genetic epidemiology: fit a model with and
without the random effect in question, then compare the log (residual)
likelihood ratio to a chi-square statistic. FDR can be used on top of
that to account for multiple tests. Of course, now you have to fit three
models (one null models for each VC), so your cpu time has just
multiplied by almost 3. Definitely using refit and update will help with
compute time when having so many models.
I use sobj<-summary(result) as an intermediate step to get the info
(although this may add to the computational burden and other suggestions
you got may be faster and more efficient),
then ask for slots:
@REmat
@coefs
...for getting estimates of variance components and fixed effects.
@coefs gives you a t-statistic for fixed effects too... you could take a
stab at an approximated df method and compute an (approximated) p-value.
I know that doing so can attract a lot of criticism in this list, but
when you are fitting a several million models (10000s of transcripts and
1000s of genomic positions as in my case), the mcmc approximation is
(unfortunately) not computationally feasible.
Hope this helps!
JP
Paolo Innocenti wrote:
Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray experiment
using lme4, since other specific software seems not suitable for my
purposes. I'll briefly describe the model and kindly ask for
suggestions on the model and the workflow I can use to get useful
results.
My response variable Y is gene expression levels for a given gene (say
g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due to
the genotype
- the variance component of the interaction = the genetic variation
for sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example
below).
Now, in the light of my needs (see above), which model makes it easier
to extract the components I need? Also, do they make different
assumptions - as different levels of independency among levels of
random factors?
I will need to be able to extract the variance component values in an
iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to assess
significant differential expression of genes (since I guess I can't
get p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of
help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 |
sex:line) AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line (Intercept)
0.0152993 0.123691 sexM 0.0046474 0.068172
0.194 Residual 0.0168238 0.129707 Number of obs:
120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line sexF 0.015299
0.12369 sexM 0.023227 0.15240 0.899 Residual
0.016824 0.12971 Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
=============================
Juan Pedro Steibel
Assistant Professor
Statistical Genetics and Genomics
Department of Animal Science &
Department of Fisheries and Wildlife
Michigan State University
1205-I Anthony Hall
East Lansing, MI
48824 USA
Phone: 1-517-353-5102
E-mail: steibelj at msu.edu
Dear Douglas, Rolf, Juan and list,
thank you very much for your replies.
I now got a good working model, and the use of refit and VarCorr will
definitely help.
I had a go with mcmcsamp(), and I must confirm that this approach is not
feasible, both computationally and because if you get a "false
convergence" for, say, 1 gene out of 20, it becomes impossible to go
back and fix all the errors.
So, the alternative approach seems more promising. If I understand
correctly, you suggest to calculate a p-value for random effects out of
the LRT (Likelihood ratio test), and use approximated DFs to calculate
standard p-values for the fixed effects.
I neved used this approach, so I appreciate if you can point me in the
right direction.
Random effects:
I'd need to compare the this three model:
m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m1b <- lmer(Y1 ~ sex + (1|line))
m1c <- lmer(Y1 ~ sex)
and get the effect of the interaction from m1a-m1b,
and the effect of "line" from m1b-m1c.
It that correct?
Can anyone point me to any kind of documentation/examples to sort out
the details?
Fixed effects:
I don't really know where to get the approximated degrees of freedom.
Can you point me to an example?
Thanks again for all the help. Eventually, I'll be really happy to share
my experience/code when everything is sorted, even if I doubt I can add
anything helpful.
Best,
paolo
PS. I don't really understand what you mean by
sobj<-summary(result)
what object is your "result" here?
Juan Pedro Steibel wrote:
Hello Paolo,
We are also starting to use lmer for gene expression analysis (genetical
genomics too) so here are my thoughts.
Having two equivalent parameterizations, I would go with the
computationally fastest one (a couple more miliseconds per model, easily
add up to many hours when analyzing highly dimensional datasets). You
can time the analysis for, say, 100 transcripts and go from there.
~note: other users commented on your models not being equivalent~
For p-values:
You could use LRT to test for variance components.
This is standard practice in genetic epidemiology: fit a model with and
without the random effect in question, then compare the log (residual)
likelihood ratio to a chi-square statistic. FDR can be used on top of
that to account for multiple tests. Of course, now you have to fit three
models (one null models for each VC), so your cpu time has just
multiplied by almost 3. Definitely using refit and update will help with
compute time when having so many models.
I use sobj<-summary(result) as an intermediate step to get the info
(although this may add to the computational burden and other suggestions
you got may be faster and more efficient),
then ask for slots:
@REmat
@coefs
...for getting estimates of variance components and fixed effects.
@coefs gives you a t-statistic for fixed effects too... you could take a
stab at an approximated df method and compute an (approximated) p-value.
I know that doing so can attract a lot of criticism in this list, but
when you are fitting a several million models (10000s of transcripts and
1000s of genomic positions as in my case), the mcmc approximation is
(unfortunately) not computationally feasible.
Hope this helps!
JP
Paolo Innocenti wrote:
Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray experiment
using lme4, since other specific software seems not suitable for my
purposes. I'll briefly describe the model and kindly ask for
suggestions on the model and the workflow I can use to get useful
results.
My response variable Y is gene expression levels for a given gene (say
g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due to
the genotype
- the variance component of the interaction = the genetic variation
for sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example
below).
Now, in the light of my needs (see above), which model makes it easier
to extract the components I need? Also, do they make different
assumptions - as different levels of independency among levels of
random factors?
I will need to be able to extract the variance component values in an
iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to assess
significant differential expression of genes (since I guess I can't
get p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of
help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 |
sex:line) AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line (Intercept)
0.0152993 0.123691 sexM 0.0046474 0.068172
0.194 Residual 0.0168238 0.129707 Number of obs:
120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line sexF 0.015299
0.12369 sexM 0.023227 0.15240 0.899 Residual
0.016824 0.12971 Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyv?gen 18D
75236 Uppsala, Sweden
Sorry for replying to my own emails, but I found partial solution to my
problems in Pinheiro and Bates at:
2.4.1 - Likelihood Ratio Tests (p.83)
to calculate logLik and p.values for random effects
and
2.4.2 - Hypothesis tests for Fixed-Effect Terms (p.87)
that is self-explanatory. Although I still don't understand how to
calculate the "denominator degrees of freedom".
Best,
paolo
PS. Off-topic quick question: is there currently a "out-of-the-box"
solution to get credible intervals for random effects in mcmcsamp()?
HPDinterval (from lme4 package) doesn't seem to work for that?
Paolo Innocenti wrote:
Dear Douglas, Rolf, Juan and list,
thank you very much for your replies.
I now got a good working model, and the use of refit and VarCorr will
definitely help.
I had a go with mcmcsamp(), and I must confirm that this approach is not
feasible, both computationally and because if you get a "false
convergence" for, say, 1 gene out of 20, it becomes impossible to go
back and fix all the errors.
So, the alternative approach seems more promising. If I understand
correctly, you suggest to calculate a p-value for random effects out of
the LRT (Likelihood ratio test), and use approximated DFs to calculate
standard p-values for the fixed effects.
I neved used this approach, so I appreciate if you can point me in the
right direction.
Random effects:
I'd need to compare the this three model:
m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m1b <- lmer(Y1 ~ sex + (1|line))
m1c <- lmer(Y1 ~ sex)
and get the effect of the interaction from m1a-m1b,
and the effect of "line" from m1b-m1c.
It that correct?
Can anyone point me to any kind of documentation/examples to sort out
the details?
Fixed effects:
I don't really know where to get the approximated degrees of freedom.
Can you point me to an example?
Thanks again for all the help. Eventually, I'll be really happy to share
my experience/code when everything is sorted, even if I doubt I can add
anything helpful.
Best,
paolo
PS. I don't really understand what you mean by
sobj<-summary(result)
what object is your "result" here?
Juan Pedro Steibel wrote:
Hello Paolo,
We are also starting to use lmer for gene expression analysis
(genetical genomics too) so here are my thoughts.
Having two equivalent parameterizations, I would go with the
computationally fastest one (a couple more miliseconds per model,
easily add up to many hours when analyzing highly dimensional
datasets). You can time the analysis for, say, 100 transcripts and go
from there.
~note: other users commented on your models not being equivalent~
For p-values:
You could use LRT to test for variance components.
This is standard practice in genetic epidemiology: fit a model with
and without the random effect in question, then compare the log
(residual) likelihood ratio to a chi-square statistic. FDR can be used
on top of that to account for multiple tests. Of course, now you have
to fit three models (one null models for each VC), so your cpu time
has just multiplied by almost 3. Definitely using refit and update
will help with compute time when having so many models.
I use sobj<-summary(result) as an intermediate step to get the info
(although this may add to the computational burden and other
suggestions you got may be faster and more efficient),
then ask for slots:
@REmat
@coefs
...for getting estimates of variance components and fixed effects.
@coefs gives you a t-statistic for fixed effects too... you could take
a stab at an approximated df method and compute an (approximated)
p-value.
I know that doing so can attract a lot of criticism in this list, but
when you are fitting a several million models (10000s of transcripts
and 1000s of genomic positions as in my case), the mcmc approximation
is (unfortunately) not computationally feasible.
Hope this helps!
JP
Paolo Innocenti wrote:
Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray experiment
using lme4, since other specific software seems not suitable for my
purposes. I'll briefly describe the model and kindly ask for
suggestions on the model and the workflow I can use to get useful
results.
My response variable Y is gene expression levels for a given gene
(say g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due to
the genotype
- the variance component of the interaction = the genetic variation
for sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the same
residuals) but different parametrization (see self-contained example
below).
Now, in the light of my needs (see above), which model makes it
easier to extract the components I need? Also, do they make different
assumptions - as different levels of independency among levels of
random factors?
I will need to be able to extract the variance component values in an
iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to assess
significant differential expression of genes (since I guess I can't
get p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of
help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1 |
sex:line) AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line (Intercept)
0.0152993 0.123691 sexM 0.0046474 0.068172
0.194 Residual 0.0168238 0.129707 Number of obs:
120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line sexF 0.015299
0.12369 sexM 0.023227 0.15240 0.899 Residual
0.016824 0.12971 Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyv?gen 18D
75236 Uppsala, Sweden
Hello paolo,
sobj<-summary(result), where result is m1a, m1b or m1c.
Not that m1c is actually a fixed effects model, and can not be fit with
lmer, so you'll need to reconstruct the -2LL from another source. But
with a normal model that should not be a problem.
Fabian Scheipl answered in a post that the LRT chisquare(1) was wrong
and said that al least a mixture of chisquares should be used. I did not
explained myself very well, but I meant to recommend the mixture of
chisquare and pointmass when testing a single variance component.
Fabian's other method could be used, provided it will not take eons to
get the results for a high dimensional dataset.
But keep reading and you will find a good reason to hammer on me for
recomendations on df and computing p-values 8^D
Probably the whole point for having mcmcsamp was to actually not have to
answer that question on ddfs...
Having said that, my approach in a particular project we are working on
is to do ddf=n-p... (n=length(y), p=ncols(x), x= full rank incidence
matrix of fixed effects). I'll run and take cover now... 8^D
Seriously, in our particular design such approximation does not seem to
be way off, as I checked it with mcmcsamp for a handful of transcripts.
And finally, a more existential question:
Paolo, do you really need a p-value?
Sometimes in gene expression analysis we only need to rank genes for
evidence of differential expression.
In those cases, you may well rank then using the t-statistic or the LRT
(not their p-values), especially if you have the same basic design
structure across all genes or transcripts (that you probably do).
For example if you want to do enrichment of GO terms after the mm
analysis you could do a Kolmogorov-Smirnov type of GSE using the
t-statistics (or LRTs) to rank genes.
Hope this makes sense?
Thanks!
JP
Paolo Innocenti wrote:
Dear Douglas, Rolf, Juan and list,
thank you very much for your replies.
I now got a good working model, and the use of refit and VarCorr will
definitely help.
I had a go with mcmcsamp(), and I must confirm that this approach is
not feasible, both computationally and because if you get a "false
convergence" for, say, 1 gene out of 20, it becomes impossible to go
back and fix all the errors.
So, the alternative approach seems more promising. If I understand
correctly, you suggest to calculate a p-value for random effects out
of the LRT (Likelihood ratio test), and use approximated DFs to
calculate standard p-values for the fixed effects.
I neved used this approach, so I appreciate if you can point me in the
right direction.
Random effects:
I'd need to compare the this three model:
m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m1b <- lmer(Y1 ~ sex + (1|line))
m1c <- lmer(Y1 ~ sex)
and get the effect of the interaction from m1a-m1b,
and the effect of "line" from m1b-m1c.
It that correct?
Can anyone point me to any kind of documentation/examples to sort out
the details?
Fixed effects:
I don't really know where to get the approximated degrees of freedom.
Can you point me to an example?
Thanks again for all the help. Eventually, I'll be really happy to
share my experience/code when everything is sorted, even if I doubt I
can add anything helpful.
Best,
paolo
PS. I don't really understand what you mean by
sobj<-summary(result)
what object is your "result" here?
Juan Pedro Steibel wrote:
Hello Paolo,
We are also starting to use lmer for gene expression analysis
(genetical genomics too) so here are my thoughts.
Having two equivalent parameterizations, I would go with the
computationally fastest one (a couple more miliseconds per model,
easily add up to many hours when analyzing highly dimensional
datasets). You can time the analysis for, say, 100 transcripts and go
from there.
~note: other users commented on your models not being equivalent~
For p-values:
You could use LRT to test for variance components.
This is standard practice in genetic epidemiology: fit a model with
and without the random effect in question, then compare the log
(residual) likelihood ratio to a chi-square statistic. FDR can be
used on top of that to account for multiple tests. Of course, now you
have to fit three models (one null models for each VC), so your cpu
time has just multiplied by almost 3. Definitely using refit and
update will help with compute time when having so many models.
I use sobj<-summary(result) as an intermediate step to get the info
(although this may add to the computational burden and other
suggestions you got may be faster and more efficient),
then ask for slots:
@REmat
@coefs
...for getting estimates of variance components and fixed effects.
@coefs gives you a t-statistic for fixed effects too... you could
take a stab at an approximated df method and compute an
(approximated) p-value.
I know that doing so can attract a lot of criticism in this list, but
when you are fitting a several million models (10000s of transcripts
and 1000s of genomic positions as in my case), the mcmc approximation
is (unfortunately) not computationally feasible.
Hope this helps!
JP
Paolo Innocenti wrote:
Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray
experiment using lme4, since other specific software seems not
suitable for my purposes. I'll briefly describe the model and kindly
ask for suggestions on the model and the workflow I can use to get
useful results.
My response variable Y is gene expression levels for a given gene
(say g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due
to the genotype
- the variance component of the interaction = the genetic variation
for sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the
same residuals) but different parametrization (see self-contained
example below).
Now, in the light of my needs (see above), which model makes it
easier to extract the components I need? Also, do they make
different assumptions - as different levels of independency among
levels of random factors?
I will need to be able to extract the variance component values in
an iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to
assess significant differential expression of genes (since I guess I
can't get p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of
help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1
| sex:line) AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line
(Intercept) 0.0152993 0.123691 sexM
0.0046474 0.068172 0.194 Residual 0.0168238
0.129707 Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line sexF 0.015299
0.12369 sexM 0.023227 0.15240 0.899
Residual 0.016824 0.12971 Number of obs: 120, groups:
line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
=============================
Juan Pedro Steibel
Assistant Professor
Statistical Genetics and Genomics
Department of Animal Science &
Department of Fisheries and Wildlife
Michigan State University
1205-I Anthony Hall
East Lansing, MI
48824 USA
Phone: 1-517-353-5102
E-mail: steibelj at msu.edu
Dear Doug,
Two issues have arisen in respect of this thread that I would like to
comment upon.
(1) In my previous posting I included the link to Doug's explanation
of the
lmer() syntax:
http://www.nabble.com/lme-nesting-interaction-advice-
td17131600i20.html#a17213604
I found this explanation incredibly useful, and apparently others
have as well.
(I was thanked off-list for posting this link.)
Could, perhaps, the content of this link, in appropriately edited
form, be
included in the documentation for lmer(), or perhaps in a vignette
(to which
the documentation might direct the reader)? To save on frustrating
searching
for those who are coming in to this area fresh.
(2) In one of his posts Juan Pedro Steibel wrote:
Note that m1c is actually a fixed effects model, and can not be fit
with
lmer, so you'll need to reconstruct the -2LL from another source.
He goes on to say:
But with a normal model that should not be a problem.
Hah! Not a problem for some, maybe. But for those of us whose
initially
few brain cells have been decimated by alcohol ( :-) ) it is a problem.
Might I make a ``feature request'' that it be made possible to fit
models
with fixed effects only in lmer()? (Basically for convenience of
testing
whether random effects are ``useful''.) I personally would also be
comforted
by being able to do things like
fit1 <- lmer(y ~ x, REML=TRUE)
fit2 <- lmer(y ~ x, REML=FALSE)
and compare the resulting estimates of sigma^2 (which presumably
should differ
by a factor of n/(n-2) if I'm understanding things correctly). Being
able to
do this would give me the comforting illusion ( :-) ) that I ***am***
understanding
things correctly. Or perhaps disabuse me of this illusion.
Would it be a Herculean task to adapt lmer() to have this capacity?
Naively I would
have thought that if a model with fixed and random effects can be
fitted, then
surely a (simpler) model with only fixed effects can be fitted. OTOH
the fixed effects
only scenario is in some sense ``on the boundary'', and boundary
cases can be vexatious.
Not knowing how lmer() works, I have no idea how hard or easy making
such a change
would be.
I would appreciate hearing your thoughts on this.
cheers,
Rolf
######################################################################
Attention:\ This e-mail message is privileged and confid...{{dropped:9}}
Hi all,
thanks again to all of you for the replies.
I took Fabian's suggestion (thanks for the effort you made to make your
point! amazing) but unfortunately for my purpose it requires to much
computation time (in terms of model to be fitted and evaluation time).
At the moment I am happy with
1-(0.5 + 0.5*pchisq(tempobj$Chisq[2], df=1))
for the p-values for the random effects.
For the p-values for fixed effects: yes, i do really need them. They are
not really important for THE specific gene, but I need to be able to
say: around 30% of the transcriptome is differentially expressed).
In my case, the effect I am testing is pretty big, my sample size is
pretty large (120 observations, 60 per group - M/F), completely balanced
and I can afford to be a bit conservative. That said, I found this link:
http://wiki.r-project.org/rwiki/doku.php?id=guides:lmer-tests
where Doug discuss about using a function hatTrace() and use it to
calculate the trace and then the degrees of freedom (denDF = n - "trace"
i think) and feed it to
1 - pf(F.ratio, numDF, ***)
It seems to be what I am looking for, but I can't find the function
hatTrace...
Does anyone know how I could estimate my denDF? Any suggestion is most
welcome (JP i have some problem in understanding your lingo... =)
Pinheiro and Bates says:
denDF = m_i - (m_(i-1) + p_i), i = 1,...,Q+1
but I don't really understand what it means.
Thanks again to everyone!
Best,
paolo
Juan Pedro Steibel wrote:
Hello paolo,
sobj<-summary(result), where result is m1a, m1b or m1c.
Not that m1c is actually a fixed effects model, and can not be fit with
lmer, so you'll need to reconstruct the -2LL from another source. But
with a normal model that should not be a problem.
Fabian Scheipl answered in a post that the LRT chisquare(1) was wrong
and said that al least a mixture of chisquares should be used. I did not
explained myself very well, but I meant to recommend the mixture of
chisquare and pointmass when testing a single variance component.
Fabian's other method could be used, provided it will not take eons to
get the results for a high dimensional dataset.
But keep reading and you will find a good reason to hammer on me for
recomendations on df and computing p-values 8^D
Probably the whole point for having mcmcsamp was to actually not have to
answer that question on ddfs...
Having said that, my approach in a particular project we are working on
is to do ddf=n-p... (n=length(y), p=ncols(x), x= full rank incidence
matrix of fixed effects). I'll run and take cover now... 8^D
Seriously, in our particular design such approximation does not seem to
be way off, as I checked it with mcmcsamp for a handful of transcripts.
And finally, a more existential question:
Paolo, do you really need a p-value?
Sometimes in gene expression analysis we only need to rank genes for
evidence of differential expression.
In those cases, you may well rank then using the t-statistic or the LRT
(not their p-values), especially if you have the same basic design
structure across all genes or transcripts (that you probably do).
For example if you want to do enrichment of GO terms after the mm
analysis you could do a Kolmogorov-Smirnov type of GSE using the
t-statistics (or LRTs) to rank genes.
Hope this makes sense?
Thanks!
JP
Paolo Innocenti wrote:
Dear Douglas, Rolf, Juan and list,
thank you very much for your replies.
I now got a good working model, and the use of refit and VarCorr will
definitely help.
I had a go with mcmcsamp(), and I must confirm that this approach is
not feasible, both computationally and because if you get a "false
convergence" for, say, 1 gene out of 20, it becomes impossible to go
back and fix all the errors.
So, the alternative approach seems more promising. If I understand
correctly, you suggest to calculate a p-value for random effects out
of the LRT (Likelihood ratio test), and use approximated DFs to
calculate standard p-values for the fixed effects.
I neved used this approach, so I appreciate if you can point me in the
right direction.
Random effects:
I'd need to compare the this three model:
m1a <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m1b <- lmer(Y1 ~ sex + (1|line))
m1c <- lmer(Y1 ~ sex)
and get the effect of the interaction from m1a-m1b,
and the effect of "line" from m1b-m1c.
It that correct?
Can anyone point me to any kind of documentation/examples to sort out
the details?
Fixed effects:
I don't really know where to get the approximated degrees of freedom.
Can you point me to an example?
Thanks again for all the help. Eventually, I'll be really happy to
share my experience/code when everything is sorted, even if I doubt I
can add anything helpful.
Best,
paolo
PS. I don't really understand what you mean by
sobj<-summary(result)
what object is your "result" here?
Juan Pedro Steibel wrote:
Hello Paolo,
We are also starting to use lmer for gene expression analysis
(genetical genomics too) so here are my thoughts.
Having two equivalent parameterizations, I would go with the
computationally fastest one (a couple more miliseconds per model,
easily add up to many hours when analyzing highly dimensional
datasets). You can time the analysis for, say, 100 transcripts and go
from there.
~note: other users commented on your models not being equivalent~
For p-values:
You could use LRT to test for variance components.
This is standard practice in genetic epidemiology: fit a model with
and without the random effect in question, then compare the log
(residual) likelihood ratio to a chi-square statistic. FDR can be
used on top of that to account for multiple tests. Of course, now you
have to fit three models (one null models for each VC), so your cpu
time has just multiplied by almost 3. Definitely using refit and
update will help with compute time when having so many models.
I use sobj<-summary(result) as an intermediate step to get the info
(although this may add to the computational burden and other
suggestions you got may be faster and more efficient),
then ask for slots:
@REmat
@coefs
...for getting estimates of variance components and fixed effects.
@coefs gives you a t-statistic for fixed effects too... you could
take a stab at an approximated df method and compute an
(approximated) p-value.
I know that doing so can attract a lot of criticism in this list, but
when you are fitting a several million models (10000s of transcripts
and 1000s of genomic positions as in my case), the mcmc approximation
is (unfortunately) not computationally feasible.
Hope this helps!
JP
Paolo Innocenti wrote:
Dear Douglas and list,
I am thinking about fitting a mixed model for a microarray
experiment using lme4, since other specific software seems not
suitable for my purposes. I'll briefly describe the model and kindly
ask for suggestions on the model and the workflow I can use to get
useful results.
My response variable Y is gene expression levels for a given gene
(say g_i) from 120 samples.
The factor I want to include are:
Sex: fixed, two levels, M/F.
Line: 15 randomly picked genotypes from a large outbred population.
I am interested in:
- if the gene is differentially expressed in the 2 sexes (effect of
sex), in the 15 lines (effect of line) and the interaction of the two.
- the variance component of line = how much of the variance is due
to the genotype
- the variance component of the interaction = the genetic variation
for sexual dimorphism.
Reading a bit of this mailing list, I came up with these three models:
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
or
m2 <- lmer(Y1 ~ sex + (sex|line))
or
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
Which should all be the same model (and indeed they have all the
same residuals) but different parametrization (see self-contained
example below).
Now, in the light of my needs (see above), which model makes it
easier to extract the components I need? Also, do they make
different assumptions - as different levels of independency among
levels of random factors?
I will need to be able to extract the variance component values in
an iterative process (i have 18.000 genes): is VarCorr() the way to go?
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Last two question: what is the easier way to assess, in an iterative
process, normality of residuals, and what is a sensible way to
assess significant differential expression of genes (since I guess I
can't get p-values and then apply FDR correction?)
Thanks a lot for reading so far and I'll be grateful for any kind of
help.
paolo
Self-contained example:
Y1 <- as.numeric(
c("11.6625","11.3243","11.7819","11.5032","11.7578","11.9379","11.8491",
"11.9035","11.2042","11.0344","11.5137","11.1995","11.6327","11.7392",
"11.9869","11.6955","11.5631","11.7159","11.8435","11.5814","12.0756",
"12.3428","12.3342","11.9883","11.6067","11.6102","11.6517","11.4444",
"11.9567","12.0478","11.9683","11.8207","11.5860","11.6028","11.6522",
"11.6775","12.3570","12.2266","12.2444","12.1369","11.2573","11.4577",
"11.4432","11.2994","11.8486","11.9258","11.9864","11.9964","11.2806",
"11.2527","11.3672","11.0791","11.9501","11.7223","11.9825","11.8114",
"11.6116","11.4284","11.3731","11.6942","12.2153","12.0101","12.2150",
"12.1932","11.5617","11.3761","11.4813","11.7503","11.9889","12.1530",
"12.3299","12.4436","11.4566","11.4614","11.5527","11.3513","11.9764",
"11.8810","12.0999","11.9083","11.4870","11.6764","11.3973","11.4507",
"12.1141","11.9906","12.1118","11.9728","11.3382","11.4146","11.4590",
"11.2527","12.1101","12.0448","12.2191","11.8317","11.3982","11.3555",
"11.3897","11.7731","11.9749","11.8666","12.1984","12.0350","11.4642",
"11.4509","11.5552","11.4346","12.0714","11.7136","11.9019","11.8158",
"11.3132","11.3121","11.1612","11.2073","11.6658","11.7879","11.7847",
"11.5300"))
sex <- factor(rep(c("F","M"), 15, each=4))
line <- factor(rep(1:15, each=8))
m1 <- lmer(Y1 ~ sex + (1|line) + (1|sex:line))
m2 <- lmer(Y1 ~ sex + (sex|line))
m3 <- lmer(Y1 ~ sex + (0 + sex|line))
VarCorr(m1)$'sex:line'[1]
VarCorr(m1)$'line'[1]
Output:
m1
Linear mixed model fit by REML Formula: Y1 ~ sex + (1 | line) + (1
| sex:line) AIC BIC logLik deviance REMLdev
-91.13 -77.2 50.57 -111.1 -101.1
Random effects:
Groups Name Variance Std.Dev.
sex:line (Intercept) 0.0023237 0.048205
line (Intercept) 0.0169393 0.130151
Residual 0.0168238 0.129707
Number of obs: 120, groups: sex:line, 30; line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03955 289.72
sexM 0.52992 0.02951 17.96
Correlation of Fixed Effects:
(Intr)
sexM -0.373
m2
Linear mixed model fit by REML Formula: Y1 ~ sex + (sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line
(Intercept) 0.0152993 0.123691 sexM
0.0046474 0.068172 0.194 Residual 0.0168238
0.129707 Number of obs: 120, groups: line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52992 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
m3
Linear mixed model fit by REML Formula: Y1 ~ sex + (0 + sex | line)
AIC BIC logLik deviance REMLdev
-90 -73.27 51 -112.1 -102
Random effects:
Groups Name Variance Std.Dev. Corr line sexF 0.015299
0.12369 sexM 0.023227 0.15240 0.899
Residual 0.016824 0.12971 Number of obs: 120, groups:
line, 15
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.45977 0.03606 317.8
sexM 0.52991 0.02951 18.0
Correlation of Fixed Effects:
(Intr)
sexM -0.161
Paolo Innocenti
Department of Animal Ecology, EBC
Uppsala University
Norbyv?gen 18D
75236 Uppsala, Sweden