On Jan 2, 2015, at 12:16 PM, Vincenzo Lagani <vlagani at ics.forth.gr> wrote:
Dear Peter,
thanks for your suggestion. Unfortunately, even on my most powerful machine I could not fit the models you suggested: R returns a memory error, due to the inability of allocating ~60GB of memory (more than 18000 probesets in my data).
Moreover, I am not totally sure if the model you suggested is suitable for my application. While I fully agree with you that it would be interesting to observe what happens when the variability due to the probesets is explicitly modeled, I must say that I am not interested in the specific effect of each single gene. Moreover, the set of genes I am using can be considered as a (more or less) random sample from the set of all mouse genes. Thus, it seems to me that the term "probesets" should be definitely modeled as a random term.
Thanks again and kind regards,
Vincenzo
On 1/2/2015 4:34 PM, Peter Claussen wrote:
Vincenzo,
If I?m reading this correctly, I don?t think you?re comparing the same models, lm vs lmer
Using lm, you have two models
values ~ week + genotype
values ~ week * genotype
Using lmer, you have two different models.
week + genotype + (week | probesets)
week * genotype + (week | probesets)
If you were to include (week:probesets) in your lm models, and compare
values ~ week + genotype + week:probesets
values ~ week * genotype + week:probesets
would that clarify the confusion about the different results?
Peter
On Dec 26, 2014, at 1:11 PM, Vincenzo Lagani <vlagani at ics.forth.gr> wrote:
Dear all,
I am modelling gene expression data with mixed models using lme4. My
goal is to assess whether gene expression globally decreases or
increases with time.
Specifically, the data consist in whole expression profiles measured at
five different time points in two different strains of mice. For each
time point and each strain there are three replicates. The data are not
longitudinal, i.e., different mice are used at each time point. We had
to remove one profile from a time point because it was not matching our
quality criteria, so the data became not properly balanced.
On these data I am fitting the following model:
lme4Model.full <- lmer(values ~ week * genotype + (week | probesets),
data = dataset, REML = FALSE)
where 'values' stands for the gene expression, 'week' is a scaled
numeric reporting the age of the mice in weeks, and 'genotype' is a
factor with two levels representing the two different genetic
backgrounds. Each single gene is let having its own random intercept and
slope.
What puzzles me is that when I compare this model with its simpler,
non-mixed version:
lmModel.full <- lm(values ~ week * genotype, data = dataset)
I obtain the same coefficients but different standard errors (see
below). Furthermore, while the interaction coefficient is not
significant in the simple linear model, it becomes highly significant in
the mixed model, at least according to these ANOVA tests:
lmModel <- lm(values ~ week + genotype, data = dataset)
lme4Model <- lmer(values ~ week + genotype + (week | probesets), data =
dataset, REML = FALSE)
anova(lmModel.full, lmModel)
Analysis of Variance Table
Model 1: values ~ week * genotype
Model 2: values ~ week + genotype
Res.Df RSS Df Sum of Sq F Pr(>F)
1 414341 2162664
2 414342 2162666 -1 -2.4697 0.4732 0.4915
anova(lme4Model.full, lme4Model)
Data: dataset
Models:
lme4Model: values ~ week + genotype + (week | probesets)
lme4Model.full: values ~ week * genotype + (week | probesets)
Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
lme4Model 7 72376 72452 -36181 72362
lme4Model.full 8 72325 72413 -36155 72309 52.472 1 4.364e-13 ***
Assessing whether the interaction coefficient is significant is actually
the aim of my study, and having two totally different answers confuses
me. My understanding is that the mixed model better catches the variance
structure of the data and thus it is able to better estimate the
standard errors and p-values of the coefficients. Is this correct? In
other words, can I confidently claim that the p-values obtained from the
mixed models are "the correct ones" and that the interaction term is
actually significant?
My apologies if this issue has been already posted on this list. Despite
having seen multiple posts here on similar topics, I have not been able
to find an answer to these questions.
Thanks in advance for your help. Any suggestion is very welcome.
Regards,
Vincenzo
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: values ~ week * genotype + (week | probesets)
Data: dataset
AIC BIC logLik deviance df.resid
72325.3 72412.8 -36154.7 72309.3 414337
Scaled residuals:
Min 1Q Median 3Q Max
-13.2622 -0.5246 0.0128 0.5270 23.3456
Random effects:
Groups Name Variance Std.Dev. Corr
probesets (Intercept) 5.167338 2.27318
week 0.005029 0.07092 -0.23
Residual 0.047063 0.21694
Number of obs: 414345, groups: probesets, 18015
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.5727983 0.0169414 388.0
week -0.0151336 0.0006823 -22.2
genotypeCSB 0.2405300 0.0007121 337.8
week:genotypeCSB 0.0050430 0.0006962 7.2
Correlation of Fixed Effects:
(Intr) week gntCSB
week -0.177
genotypeCSB -0.015 -0.028
wk:gntypCSB -0.001 -0.392 -0.054
Call:
lm(formula = values ~ week * genotype, data = dataset)
Residuals:
Min 1Q Median 3Q Max
-5.7560 -1.9881 0.0083 1.6823 7.6407
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.572798 0.004407 1491.384 < 2e-16 ***
week -0.015134 0.004547 -3.329 0.000873 ***
genotypeCSB 0.240530 0.007500 32.072 < 2e-16 ***
week:genotypeCSB 0.005043 0.007331 0.688 0.491537
---
Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1
Residual standard error: 2.285 on 414341 degrees of freedom
Multiple R-squared: 0.002491, Adjusted R-squared: 0.002484
F-statistic: 344.9 on 3 and 414341 DF, p-value: < 2.2e-16
[[alternative HTML version deleted]]