Dear LMM experts:
I am pretty new to using LMM and I have found the following situation
bewildering as I was trying to do diagnostics with my fitted model: my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am trying
to regress Y onto EVs include Drink, Gender, & Age, while using random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know from
GLM that this usually suggest nonlinear relationships between the EVs and
the DV.
I read some online posts (post1
<http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
post2
<http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of these
changes have led to a substantial difference for the residual & fitted
value correlation.
Some other info:
1) my overall model fit is not poor as indicated by the correlation between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in qqplots.
4) conditional residuals are not correlated with any fixed effects, such as
Drink or Age.
I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and if
there is any problems with my model.
Thank you very much,
Cherry
LMM diagnostics: conditional residuals correlated highly with fitted values
8 messages · Yizhou Ma, Thierry Onkelinx, Ulf Köther
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each variable (continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts:
I am pretty new to using LMM and I have found the following situation
bewildering as I was trying to do diagnostics with my fitted model: my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am trying
to regress Y onto EVs include Drink, Gender, & Age, while using random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know from
GLM that this usually suggest nonlinear relationships between the EVs and
the DV.
I read some online posts (post1
<
http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model
)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of these
changes have led to a substantial difference for the residual & fitted
value correlation.
Some other info:
1) my overall model fit is not poor as indicated by the correlation between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in qqplots.
4) conditional residuals are not correlated with any fixed effects, such as
Drink or Age.
I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hi Thierry,
Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.
Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
Data: data
AIC BIC logLik deviance df.resid
1046.4 1074.0 -516.2 1032.4 372
Scaled residuals:
Min 1Q Median 3Q Max
-2.67228 -0.56085 -0.02968 0.66166 2.91452
Random effects:
Groups Name Variance Std.Dev.
Family_ID (Intercept) 0.3550 0.5958
Residual 0.6162 0.7850
Number of obs: 379, groups: Family_ID, 189
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.10309 0.43921 2.511
Drink 0.16425 0.08031 2.045
Gender.M -0.19364 0.10874 -1.781
Age -0.03377 0.01489 -2.268
Drink:Gender.M -0.13647 0.10681 -1.278
Correlation of Fixed Effects:
(Intr) Drnk Gndr.M Age
Drink -0.098
Gender.M -0.040 -0.249
Age -0.985 0.158 -0.054
Drnk:G.M 0.042 -0.737 -0.021 -0.085
Thank you very much,
Cherry
On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each variable (continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts:
I am pretty new to using LMM and I have found the following situation
bewildering as I was trying to do diagnostics with my fitted model: my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am
trying
to regress Y onto EVs include Drink, Gender, & Age, while using random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the
residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know from
GLM that this usually suggest nonlinear relationships between the EVs and
the DV.
I read some online posts (post1
<http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
post2
<http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of these
changes have led to a substantial difference for the residual & fitted
value correlation.
Some other info:
1) my overall model fit is not poor as indicated by the correlation
between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in qqplots.
4) conditional residuals are not correlated with any fixed effects, such
as
Drink or Age.
I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Can you elaborate on what Y is? Does it has a lower boundary? And if so, do you have observations near that boundary? E.g. Y must be non-negative and the dataset contains observations close to 0. A densityplot would be useful. ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Hi Thierry,
Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.
Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
Data: data
AIC BIC logLik deviance df.resid
1046.4 1074.0 -516.2 1032.4 372
Scaled residuals:
Min 1Q Median 3Q Max
-2.67228 -0.56085 -0.02968 0.66166 2.91452
Random effects:
Groups Name Variance Std.Dev.
Family_ID (Intercept) 0.3550 0.5958
Residual 0.6162 0.7850
Number of obs: 379, groups: Family_ID, 189
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.10309 0.43921 2.511
Drink 0.16425 0.08031 2.045
Gender.M -0.19364 0.10874 -1.781
Age -0.03377 0.01489 -2.268
Drink:Gender.M -0.13647 0.10681 -1.278
Correlation of Fixed Effects:
(Intr) Drnk Gndr.M Age
Drink -0.098
Gender.M -0.040 -0.249
Age -0.985 0.158 -0.054
Drnk:G.M 0.042 -0.737 -0.021 -0.085
Thank you very much,
Cherry
On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each
variable
(continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and
Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more
than
asking him to perform a post-mortem examination: he may be able to say
what
the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts:
I am pretty new to using LMM and I have found the following situation
bewildering as I was trying to do diagnostics with my fitted model: my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am
trying
to regress Y onto EVs include Drink, Gender, & Age, while using random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted
values. I
expect them to be uncorrelated and I expect to see homoscedasticity. Yet to my surprise there is a high correlation (~0.5) between the residuals and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know
from
GLM that this usually suggest nonlinear relationships between the EVs
and
the DV. I read some online posts (post1 <
post2 <
)
that suggest this can result from a poor model fit. So I tried a few different models, including: 1) log transform Drink, which is originally positively skewed; 2) add random slopes for Drink, Age, etc. None of
these
changes have led to a substantial difference for the residual & fitted value correlation. Some other info: 1) my overall model fit is not poor as indicated by the correlation between fitted values & Y. It is around 0.8; 2) most variables in my model has a normal, or at least symmetrical, distribution. 3) conditional residuals are normally distributed as shown in qqplots. 4) conditional residuals are not correlated with any fixed effects, such as Drink or Age. I have two guesses as to what is going on: 1) maybe the fact that each family is a different size actually violates assumptions of the model? 2) or maybe there is something wrong with estimation of the random
effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Y is a brain measure that has been standardized. A histogram of Y is here: http://imgur.com/Um8yyuu I am confused about the "Y must be non-negative and the dataset contains observations close to 0" part. Is that the requirements for Y? Is so, then my model could be wrong. On Wed, Oct 7, 2015 at 10:15 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Can you elaborate on what Y is? Does it has a lower boundary? And if so, do you have observations near that boundary? E.g. Y must be non-negative and the dataset contains observations close to 0. A densityplot would be useful. ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Hi Thierry,
Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.
Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
Data: data
AIC BIC logLik deviance df.resid
1046.4 1074.0 -516.2 1032.4 372
Scaled residuals:
Min 1Q Median 3Q Max
-2.67228 -0.56085 -0.02968 0.66166 2.91452
Random effects:
Groups Name Variance Std.Dev.
Family_ID (Intercept) 0.3550 0.5958
Residual 0.6162 0.7850
Number of obs: 379, groups: Family_ID, 189
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.10309 0.43921 2.511
Drink 0.16425 0.08031 2.045
Gender.M -0.19364 0.10874 -1.781
Age -0.03377 0.01489 -2.268
Drink:Gender.M -0.13647 0.10681 -1.278
Correlation of Fixed Effects:
(Intr) Drnk Gndr.M Age
Drink -0.098
Gender.M -0.040 -0.249
Age -0.985 0.158 -0.054
Drnk:G.M 0.042 -0.737 -0.021 -0.085
Thank you very much,
Cherry
On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each variable (continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts:
I am pretty new to using LMM and I have found the following situation
bewildering as I was trying to do diagnostics with my fitted model: my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am
trying
to regress Y onto EVs include Drink, Gender, & Age, while using random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted
values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the
residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know
from
GLM that this usually suggest nonlinear relationships between the EVs
and
the DV.
I read some online posts (post1
<http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
post2
<http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is
originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of
these
changes have led to a substantial difference for the residual & fitted
value correlation.
Some other info:
1) my overall model fit is not poor as indicated by the correlation
between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in qqplots.
4) conditional residuals are not correlated with any fixed effects,
such
as
Drink or Age.
I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually
violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random
effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
My example is not a requirement of a LMM but rather an example of a distribution of a variable which can cause troubles with a LMM. Think of an area. An area cannot be negative. This can cause artefacts into the residuals when you have lots of values near zero. Have a look at this example. n <- 200 dataset <- data.frame( X = runif(n) ) dataset$eta <- -.1 + 3 * dataset$X dataset$Y <- rpois(n, lambda = exp(dataset$eta)) model <- lm(Y~ X, data = dataset) #wrong analysis for this kind of data, here just an illustration of the problem plot(fitted(model), resid(model)) But this doesn't seems to be the problem in your case. I would recommend that you see if there are patterns in the residuals when you plot them against the covariates. Maybe you are missing an interaction or even an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:29 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Y is a brain measure that has been standardized. A histogram of Y is here: http://imgur.com/Um8yyuu I am confused about the "Y must be non-negative and the dataset contains observations close to 0" part. Is that the requirements for Y? Is so, then my model could be wrong. On Wed, Oct 7, 2015 at 10:15 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:
Can you elaborate on what Y is? Does it has a lower boundary? And if so,
do
you have observations near that boundary? E.g. Y must be non-negative and the dataset contains observations close to 0. A densityplot would be
useful.
ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature
and
Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more
than
asking him to perform a post-mortem examination: he may be able to say
what
the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey 2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Hi Thierry,
Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.
Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
Data: data
AIC BIC logLik deviance df.resid
1046.4 1074.0 -516.2 1032.4 372
Scaled residuals:
Min 1Q Median 3Q Max
-2.67228 -0.56085 -0.02968 0.66166 2.91452
Random effects:
Groups Name Variance Std.Dev.
Family_ID (Intercept) 0.3550 0.5958
Residual 0.6162 0.7850
Number of obs: 379, groups: Family_ID, 189
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.10309 0.43921 2.511
Drink 0.16425 0.08031 2.045
Gender.M -0.19364 0.10874 -1.781
Age -0.03377 0.01489 -2.268
Drink:Gender.M -0.13647 0.10681 -1.278
Correlation of Fixed Effects:
(Intr) Drnk Gndr.M Age
Drink -0.098
Gender.M -0.040 -0.249
Age -0.985 0.158 -0.054
Drnk:G.M 0.042 -0.737 -0.021 -0.085
Thank you very much,
Cherry
On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each variable (continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no
more
than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does
not
ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts: I am pretty new to using LMM and I have found the following situation bewildering as I was trying to do diagnostics with my fitted model:
my
conditional residuals correlated highly with the fitted values. I have a dataset with multiple families, each has 1-4 siblings. I am trying to regress Y onto EVs include Drink, Gender, & Age, while using
random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted
values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the
residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know
from
GLM that this usually suggest nonlinear relationships between the EVs
and
the DV.
I read some online posts (post1
<
post2 <
)
that suggest this can result from a poor model fit. So I tried a few different models, including: 1) log transform Drink, which is originally positively skewed; 2) add random slopes for Drink, Age, etc. None of these changes have led to a substantial difference for the residual &
fitted
value correlation. Some other info: 1) my overall model fit is not poor as indicated by the correlation between fitted values & Y. It is around 0.8; 2) most variables in my model has a normal, or at least symmetrical, distribution. 3) conditional residuals are normally distributed as shown in
qqplots.
4) conditional residuals are not correlated with any fixed effects, such as Drink or Age. I have two guesses as to what is going on: 1) maybe the fact that each family is a different size actually violates assumptions of the model? 2) or maybe there is something wrong with estimation of the random effect (family intercept)? I'd really appreciate your insights as to what is going on here and
if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hi Thierry, Thank you for clarifying. I agree that high skewness can lead to nonlinear relationship which can not be properly modeled in linear models. I have plotted the residuals against all my fixed factors and I cannot find any nonlinear relationship. It is possible that I am missing an important covariate though. Thanks a lot, Cherry On Wed, Oct 7, 2015 at 10:54 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
My example is not a requirement of a LMM but rather an example of a distribution of a variable which can cause troubles with a LMM. Think of an area. An area cannot be negative. This can cause artefacts into the residuals when you have lots of values near zero. Have a look at this example. n <- 200 dataset <- data.frame( X = runif(n) ) dataset$eta <- -.1 + 3 * dataset$X dataset$Y <- rpois(n, lambda = exp(dataset$eta)) model <- lm(Y~ X, data = dataset) #wrong analysis for this kind of data, here just an illustration of the problem plot(fitted(model), resid(model)) But this doesn't seems to be the problem in your case. I would recommend that you see if there are patterns in the residuals when you plot them against the covariates. Maybe you are missing an interaction or even an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:29 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Y is a brain measure that has been standardized. A histogram of Y is here: http://imgur.com/Um8yyuu I am confused about the "Y must be non-negative and the dataset contains observations close to 0" part. Is that the requirements for Y? Is so, then my model could be wrong. On Wed, Oct 7, 2015 at 10:15 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:
Can you elaborate on what Y is? Does it has a lower boundary? And if so, do you have observations near that boundary? E.g. Y must be non-negative and the dataset contains observations close to 0. A densityplot would be useful. ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Hi Thierry,
Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.
Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
Data: data
AIC BIC logLik deviance df.resid
1046.4 1074.0 -516.2 1032.4 372
Scaled residuals:
Min 1Q Median 3Q Max
-2.67228 -0.56085 -0.02968 0.66166 2.91452
Random effects:
Groups Name Variance Std.Dev.
Family_ID (Intercept) 0.3550 0.5958
Residual 0.6162 0.7850
Number of obs: 379, groups: Family_ID, 189
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.10309 0.43921 2.511
Drink 0.16425 0.08031 2.045
Gender.M -0.19364 0.10874 -1.781
Age -0.03377 0.01489 -2.268
Drink:Gender.M -0.13647 0.10681 -1.278
Correlation of Fixed Effects:
(Intr) Drnk Gndr.M Age
Drink -0.098
Gender.M -0.040 -0.249
Age -0.985 0.158 -0.054
Drnk:G.M 0.042 -0.737 -0.021 -0.085
Thank you very much,
Cherry
On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each variable (continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts:
I am pretty new to using LMM and I have found the following
situation
bewildering as I was trying to do diagnostics with my fitted model:
my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am
trying
to regress Y onto EVs include Drink, Gender, & Age, while using
random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted
values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the
residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know
from
GLM that this usually suggest nonlinear relationships between the
EVs
and
the DV.
I read some online posts (post1
<http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
post2
<http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is
originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of
these
changes have led to a substantial difference for the residual &
fitted
value correlation.
Some other info:
1) my overall model fit is not poor as indicated by the correlation
between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in
qqplots.
4) conditional residuals are not correlated with any fixed effects,
such
as
Drink or Age.
I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually
violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random
effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and
if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Dear Cherry, maybe the correlation - which by the way seemed not that excessive to me according to the first plot you posted but regardless of the r = 0.5 value (and I might be wrong with that totally!) - between your fitted values and the residuals is coming from something like a non-linear effect of age or drink on Y? To test this (in kind of half-formal way), try this: library(mgcv) Res1 <- resid(model, scaled = TRUE) L1 <- gam(Res1 ~ s(age), data = data) plot(L1, xlab = "age") points(x = data$age, y = Res1) abline(h = 0) ...and then the same for drink. If there is no remaining non-linear age effect in the residuals then this smoother should be around the horizontal line at 0 for all age values, and the p-value of the smoother should then indicate a non-significant age effect. Good luck, Ulf Am 07.10.2015 um 18:05 schrieb Yizhou Ma:
Hi Thierry, Thank you for clarifying. I agree that high skewness can lead to nonlinear relationship which can not be properly modeled in linear models. I have plotted the residuals against all my fixed factors and I cannot find any nonlinear relationship. It is possible that I am missing an important covariate though. Thanks a lot, Cherry On Wed, Oct 7, 2015 at 10:54 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:
My example is not a requirement of a LMM but rather an example of a distribution of a variable which can cause troubles with a LMM. Think of an area. An area cannot be negative. This can cause artefacts into the residuals when you have lots of values near zero. Have a look at this example. n <- 200 dataset <- data.frame( X = runif(n) ) dataset$eta <- -.1 + 3 * dataset$X dataset$Y <- rpois(n, lambda = exp(dataset$eta)) model <- lm(Y~ X, data = dataset) #wrong analysis for this kind of data, here just an illustration of the problem plot(fitted(model), resid(model)) But this doesn't seems to be the problem in your case. I would recommend that you see if there are patterns in the residuals when you plot them against the covariates. Maybe you are missing an interaction or even an important covariate. Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:29 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Y is a brain measure that has been standardized. A histogram of Y is here: http://imgur.com/Um8yyuu I am confused about the "Y must be non-negative and the dataset contains observations close to 0" part. Is that the requirements for Y? Is so, then my model could be wrong. On Wed, Oct 7, 2015 at 10:15 AM, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:
Can you elaborate on what Y is? Does it has a lower boundary? And if so, do you have observations near that boundary? E.g. Y must be non-negative and the dataset contains observations close to 0. A densityplot would be useful. ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Hi Thierry,
Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.
Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
Data: data
AIC BIC logLik deviance df.resid
1046.4 1074.0 -516.2 1032.4 372
Scaled residuals:
Min 1Q Median 3Q Max
-2.67228 -0.56085 -0.02968 0.66166 2.91452
Random effects:
Groups Name Variance Std.Dev.
Family_ID (Intercept) 0.3550 0.5958
Residual 0.6162 0.7850
Number of obs: 379, groups: Family_ID, 189
Fixed effects:
Estimate Std. Error t value
(Intercept) 1.10309 0.43921 2.511
Drink 0.16425 0.08031 2.045
Gender.M -0.19364 0.10874 -1.781
Age -0.03377 0.01489 -2.268
Drink:Gender.M -0.13647 0.10681 -1.278
Correlation of Fixed Effects:
(Intr) Drnk Gndr.M Age
Drink -0.098
Gender.M -0.040 -0.249
Age -0.985 0.158 -0.054
Drnk:G.M 0.042 -0.737 -0.021 -0.085
Thank you very much,
Cherry
On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
Dear Cherry, Please don't post in HTML. Have a look at the posting guide. You'll need to provide more information. What is the class of each variable (continuous, count, presence/absence, factor, ...)? What is the output of summary(model)? Best regards, ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:
Dear LMM experts:
I am pretty new to using LMM and I have found the following
situation
bewildering as I was trying to do diagnostics with my fitted model:
my
conditional residuals correlated highly with the fitted values.
I have a dataset with multiple families, each has 1-4 siblings. I am
trying
to regress Y onto EVs include Drink, Gender, & Age, while using
random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
+(1|Family_ID),data,REML=FALSE)
After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted
values. I
expect them to be uncorrelated and I expect to see homoscedasticity.
Yet to my surprise there is a high correlation (~0.5) between the
residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know
from
GLM that this usually suggest nonlinear relationships between the
EVs
and
the DV.
I read some online posts (post1
<http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
post2
<http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is
originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of
these
changes have led to a substantial difference for the residual &
fitted
value correlation.
Some other info:
1) my overall model fit is not poor as indicated by the correlation
between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in
qqplots.
4) conditional residuals are not correlated with any fixed effects,
such
as
Drink or Age.
I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually
violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random
effect
(family intercept)?
I'd really appreciate your insights as to what is going on here and
if
there is any problems with my model.
Thank you very much,
Cherry
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models .
-- _____________________________________________________________________ Universit?tsklinikum Hamburg-Eppendorf; K?rperschaft des ?ffentlichen Rechts; Gerichtsstand: Hamburg | www.uke.de Vorstandsmitglieder: Prof. Dr. Burkhard G?ke (Vorsitzender), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Pr?l?, Rainer Schoppik _____________________________________________________________________ SAVE PAPER - THINK BEFORE PRINTING