Skip to content

LMM diagnostics: conditional residuals correlated highly with fitted values

8 messages · Yizhou Ma, Thierry Onkelinx, Ulf Köther

#
Dear LMM experts:

I am pretty new to using LMM and I have found the following situation
bewildering as I was trying to do diagnostics with my fitted model: my
conditional residuals correlated highly with the fitted values.

I have a dataset with multiple families, each has 1-4 siblings. I am trying
to regress Y onto EVs include Drink, Gender, & Age, while using random
intercept for family. This is the model I used:
model<-lmer(Y~Drink*Gender+Age
                      +(1|Family_ID),data,REML=FALSE)

After fitting the model, I used
plot(model)
to see the relationship between conditional residuals and fitted values. I
expect them to be uncorrelated and I expect to see homoscedasticity.

Yet to my surprise there is a high correlation (~0.5) between the residuals
and the fitted values. (see here <http://imgur.com/pPsG4aR>). I know from
GLM that this usually suggest nonlinear relationships between the EVs and
the DV.

I read some online posts (post1
<http://stats.stackexchange.com/questions/43566/strange-pattern-in-residual-plot-from-mixed-effect-model>
post2
<http://stats.stackexchange.com/questions/168179/correlation-between-standardized-residuals-and-fitted-values-in-a-linear-mixed-e/168210#168210>)
that suggest this can result from a poor model fit. So I tried a few
different models, including: 1) log transform Drink, which is originally
positively skewed; 2) add random slopes for Drink, Age, etc. None of these
changes have led to a substantial difference for the residual & fitted
value correlation.

Some other info:
1) my overall model fit is not poor as indicated by the correlation between
fitted values & Y. It is around 0.8;
2) most variables in my model has a normal, or at least symmetrical,
distribution.
3) conditional residuals are normally distributed as shown in qqplots.
4) conditional residuals are not correlated with any fixed effects, such as
Drink or Age.

I have two guesses as to what is going on:
1) maybe the fact that each family is a different size actually violates
assumptions of the model?
2) or maybe there is something wrong with estimation of the random effect
(family intercept)?

I'd really appreciate your insights as to what is going on here and if
there is any problems with my model.

Thank you very much,
Cherry
#
Dear Cherry,

Please don't post in HTML. Have a look at the posting guide.

You'll need to provide more information. What is the class of each variable
(continuous, count, presence/absence, factor, ...)? What is the output of
summary(model)?

Best regards,

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-10-06 17:15 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:

  
  
#
Hi Thierry,

Thank you for your reply and sorry for the HTML thing. Below is my
summary(model) output.

Y, Drink, and Age are continuous variables
Gender is F & M.
Family_ID is a factor.

Linear mixed model fit by maximum likelihood  ['lmerMod']
Formula: Y ~ Drink * Gender + Age + (1 | Family_ID)
   Data: data

     AIC      BIC   logLik deviance df.resid
  1046.4   1074.0   -516.2   1032.4      372

Scaled residuals:
     Min       1Q   Median       3Q      Max
-2.67228 -0.56085 -0.02968  0.66166  2.91452

Random effects:
 Groups    Name        Variance Std.Dev.
 Family_ID (Intercept) 0.3550   0.5958
 Residual                    0.6162   0.7850
Number of obs: 379, groups:  Family_ID, 189

Fixed effects:
                          Estimate Std. Error t value
(Intercept)          1.10309    0.43921   2.511
Drink                  0.16425    0.08031   2.045
Gender.M          -0.19364    0.10874  -1.781
Age                    -0.03377    0.01489  -2.268
Drink:Gender.M -0.13647    0.10681  -1.278

Correlation of Fixed Effects:
                (Intr)     Drnk   Gndr.M  Age
Drink        -0.098
Gender.M -0.040 -0.249
Age           -0.985  0.158 -0.054
Drnk:G.M  0.042 -0.737 -0.021 -0.085

Thank you very much,
Cherry

On Wed, Oct 7, 2015 at 5:14 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
#
Can you elaborate on what Y is? Does it has a lower boundary? And if so, do
you have observations near that boundary? E.g. Y must be non-negative and
the dataset contains observations close to 0. A densityplot would be useful.

ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-10-07 17:09 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:

  
  
#
Y is a brain measure that has been standardized. A histogram of Y is here:
http://imgur.com/Um8yyuu

I am confused about the "Y must be non-negative and the dataset
contains observations close to 0" part. Is that the requirements for
Y? Is so, then my model could be wrong.

On Wed, Oct 7, 2015 at 10:15 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
#
My example is not a requirement of a LMM but rather an example of a
distribution of a variable which can cause troubles with a LMM. Think of an
area. An area cannot be negative. This can cause artefacts into the
residuals when you have lots of values near zero. Have a look at this
example.

n <- 200
dataset <- data.frame(
  X = runif(n)
)
dataset$eta <- -.1 + 3 * dataset$X
dataset$Y <- rpois(n, lambda = exp(dataset$eta))
model <- lm(Y~ X, data = dataset) #wrong analysis for this kind of data,
here just an illustration of the problem
plot(fitted(model), resid(model))

But this doesn't seems to be the problem in your case.

I would recommend that you see if there are patterns in the residuals when
you plot them against the covariates. Maybe you are missing an interaction
or even an important covariate.

Best regards,


ir. Thierry Onkelinx
Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
Forest
team Biometrie & Kwaliteitszorg / team Biometrics & Quality Assurance
Kliniekstraat 25
1070 Anderlecht
Belgium

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

2015-10-07 17:29 GMT+02:00 Yizhou Ma <maxxx848 at umn.edu>:

  
  
#
Hi Thierry,

Thank you for clarifying. I agree that high skewness can lead to
nonlinear relationship which can not be properly modeled in linear
models.

I have plotted the residuals against all my fixed factors and I cannot
find any nonlinear relationship. It is possible that I am missing an
important covariate though.

Thanks a lot,
Cherry


On Wed, Oct 7, 2015 at 10:54 AM, Thierry Onkelinx
<thierry.onkelinx at inbo.be> wrote:
#
Dear Cherry,

maybe the correlation - which by the way seemed not that excessive to me
according to the first plot you posted but regardless of the r = 0.5
value (and I might be wrong with that totally!) - between your fitted
values and the residuals is coming from something like a non-linear
effect of age or drink on Y? To test this (in kind of half-formal way),
try this:

library(mgcv)
Res1 <- resid(model, scaled = TRUE)
L1 <- gam(Res1 ~ s(age), data = data)
plot(L1, xlab = "age")
points(x = data$age, y = Res1)
abline(h = 0)

...and then the same for drink. If there is no remaining non-linear age
effect in the residuals then this smoother should be around the
horizontal line at 0 for all age values, and the p-value of the smoother
should then indicate a non-significant age effect.

Good luck,

Ulf




Am 07.10.2015 um 18:05 schrieb Yizhou Ma:
--

_____________________________________________________________________

Universit?tsklinikum Hamburg-Eppendorf; K?rperschaft des ?ffentlichen Rechts; Gerichtsstand: Hamburg | www.uke.de
Vorstandsmitglieder: Prof. Dr. Burkhard G?ke (Vorsitzender), Prof. Dr. Dr. Uwe Koch-Gromus, Joachim Pr?l?, Rainer Schoppik
_____________________________________________________________________

SAVE PAPER - THINK BEFORE PRINTING