standard error and statistical significance in lmer versus lm
Dear prof. Bolker, thanks a lot for your answer. I have not been able to observe any definitive pattern by following your suggestion: One thing that might be interesting would be plotting the residuals from a model of only probeset variation (i.e., values ~ (week|probeset)) and seeing how the week*genotype pattern was (hopefully) clarified. Could you please be a little bit more specific about how the residuals from the model you suggest should be plotted? Here the summary of the model, for yours and other readers perusal:
lme4Model.random <- lmer(values ~ (week|probesets), data = dataset, REML = FALSE);
summary(lme4Model.random)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: values ~ (week | probesets)
Data: dataset
AIC BIC logLik deviance df.resid
172535.9 172590.6 -86263.0 172525.9 414340
Scaled residuals:
Min 1Q Median 3Q Max
-11.2119 -0.5763 -0.0112 0.5812 21.3413
Random effects:
Groups Name Variance Std.Dev. Corr
probesets (Intercept) 5.166933 2.27309
week 0.004412 0.06642 -0.25
Residual 0.061336 0.24766
Number of obs: 414345, groups: probesets, 18015
Fixed effects:
Estimate Std. Error t value
(Intercept) 6.64697 0.01662 400
Thanks again,
Vincenzo
On 12/27/2014 4:33 AM, Ben Bolker wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 14-12-26 02:11 PM, Vincenzo Lagani wrote:
Dear all, I am modelling gene expression data with mixed models using lme4. My goal is to assess whether gene expression globally decreases or increases with time. Specifically, the data consist in whole expression profiles measured at five different time points in two different strains of mice. For each time point and each strain there are three replicates. The data are not longitudinal, i.e., different mice are used at each time point. We had to remove one profile from a time point because it was not matching our quality criteria, so the data became not properly balanced. On these data I am fitting the following model: lme4Model.full <- lmer(values ~ week * genotype + (week | probesets), data = dataset, REML = FALSE) where 'values' stands for the gene expression, 'week' is a scaled numeric reporting the age of the mice in weeks, and 'genotype' is a factor with two levels representing the two different genetic backgrounds. Each single gene is let having its own random intercept and slope. What puzzles me is that when I compare this model with its simpler, non-mixed version: lmModel.full <- lm(values ~ week * genotype, data = dataset) I obtain the same coefficients but different standard errors (see below). Furthermore, while the interaction coefficient is not significant in the simple linear model, it becomes highly significant in the mixed model, at least according to these ANOVA tests: lmModel <- lm(values ~ week + genotype, data = dataset) lme4Model <- lmer(values ~ week + genotype + (week | probesets), data = dataset, REML = FALSE)
anova(lmModel.full, lmModel)
Analysis of Variance Table Model 1: values ~ week * genotype Model 2: values ~ week + genotype Res.Df RSS Df Sum of Sq F Pr(>F) 1 414341 2162664 2 414342 2162666 -1 -2.4697 0.4732 0.4915
anova(lme4Model.full, lme4Model)
Data: dataset Models: lme4Model: values ~ week + genotype + (week | probesets) lme4Model.full: values ~ week * genotype + (week | probesets) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) lme4Model 7 72376 72452 -36181 72362 lme4Model.full 8 72325 72413 -36155 72309 52.472 1 4.364e-13 *** Assessing whether the interaction coefficient is significant is actually the aim of my study, and having two totally different answers confuses me. My understanding is that the mixed model better catches the variance structure of the data and thus it is able to better estimate the standard errors and p-values of the coefficients. Is this correct? In other words, can I confidently claim that the p-values obtained from the mixed models are "the correct ones" and that the interaction term is actually significant?
As far as I can tell from what you've posted, the result given by lme4 is indeed (or certainly could be/I have no reason to believe it is not) correct. I'm not sure of the best way to explain the result to you, though. Adding the variation among probesets to the model does indeed explain a lot of variation that would otherwise end up being modeled as error and filtering into the standard errors. It would be nice to understand the differences by visualizing the different models, but with such a large data set it could be challenging ... One thing that might be interesting would be plotting the residuals from a model of only probeset variation (i.e., values ~ (week|probeset)) and seeing how the week*genotype pattern was (hopefully) clarified.
My apologies if this issue has been already posted on this list. Despite having seen multiple posts here on similar topics, I have not been able to find an answer to these questions. Thanks in advance for your help. Any suggestion is very welcome. Regards, Vincenzo
summary(lme4Model.full)
Linear mixed model fit by maximum likelihood ['lmerMod'] Formula: values ~ week * genotype + (week | probesets) Data: dataset AIC BIC logLik deviance df.resid 72325.3 72412.8 -36154.7 72309.3 414337 Scaled residuals: Min 1Q Median 3Q Max -13.2622 -0.5246 0.0128 0.5270 23.3456 Random effects: Groups Name Variance Std.Dev. Corr probesets (Intercept) 5.167338 2.27318 week 0.005029 0.07092 -0.23 Residual 0.047063 0.21694 Number of obs: 414345, groups: probesets, 18015 Fixed effects: Estimate Std. Error t value (Intercept) 6.5727983 0.0169414 388.0 week -0.0151336 0.0006823 -22.2 genotypeCSB 0.2405300 0.0007121 337.8 week:genotypeCSB 0.0050430 0.0006962 7.2 Correlation of Fixed Effects: (Intr) week gntCSB week -0.177 genotypeCSB -0.015 -0.028 wk:gntypCSB -0.001 -0.392 -0.054
summary(lmModel.full)
Call: lm(formula = values ~ week * genotype, data = dataset) Residuals: Min 1Q Median 3Q Max -5.7560 -1.9881 0.0083 1.6823 7.6407 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.572798 0.004407 1491.384 < 2e-16 *** week -0.015134 0.004547 -3.329 0.000873 *** genotypeCSB 0.240530 0.007500 32.072 < 2e-16 *** week:genotypeCSB 0.005043 0.007331 0.688 0.491537 --- Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1 Residual standard error: 2.285 on 414341 degrees of freedom Multiple R-squared: 0.002491, Adjusted R-squared: 0.002484 F-statistic: 344.9 on 3 and 414341 DF, p-value: < 2.2e-16
-----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) iQEcBAEBAgAGBQJUniiHAAoJEOCV5YRblxUH5GEH/R57rvnCi7OYj0HEV+1dQ9sv awXYPd9/q7t2ZPVyrJ8OdVL2+ntVe7KYKFy28D2uRa5eyyH6/jaoy9nSlGI4Gvd0 dslGQYlIpSw9LmOHY1BPcQYZuqEoJoHlbonbX+00AwgANdanP0CpSWNzNVRmbcUL ftPAErAaJecb7yu56+I2Yz5ugN3NYrqNdWvTV/HYxt5emjx45gQdQd4cQRTfKw3n JkcM8DMRrOjvN6w2H8Pgps/yE3W+nx5VsgBaSdmagwTIfke6aZ90+55jMhjbDXNJ xsbFzPc3CUA5SUO7qQHwoteMz8QlfRp3D8SgfmdfaPixskWCdMHjxAws+0ePAhw= =0jkf -----END PGP SIGNATURE-----
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models