Skip to content

Including random effects creates structure in the residuals

5 messages · Paul Johnson, Peter Claussen, Pierre de Villemereuil

#
Dear all,

I have an issue that I can't get my head around. I am working on a human cohort dataset studying heart rate. We have repeated measures at several time points and a model with different slopes according to binned age categories (the variable called "broken" hereafter, for "broken lines").

My issue is that when I include an individual ID effect (to account for the repeated measures), I obtain structured residuals while this is not the case for a model without this effect.

Here are my models:
mod_withID <- lmer(cardfreq ~ sex + 
								broken + 
								age:broken + 
								betabloq + 
								cafethe + 
								tabac + 
								alcool +
								(1|visite) +
								(1|id),
				   data = sub)
mod_noID <- lmer(cardfreq ~ sex + 
								broken + 
								age:broken + 
								betabloq + 
								cafethe + 
								tabac + 
								alcool +
								(1|visite),
				  data = sub)

The AIC (computed with a fit with REML = FALSE) clearly favours the model including the ID effect:
AIC(mod_withID)
75184.51
AIC(mod_noID)
76942.09

Yet, the model including the ID effect suffers from a bad fit from the residuals point of view (structured residuals) as the plots below show:
- The residuals with the ID effect:
https://ibb.co/b6WsFx
- The residuals without the ID effect:
https://ibb.co/fFVDNc
I'm a bit puzzled by this. Why would adding an individual effect would create such a structure in the residual part? Why does this covariance between the individual BLUPs and the residual arise?

I'd happily take anyone's input on this as I'm at a loss regarding what to do to solve this.

Cheers,
Pierre
#
Hi Pierre,

I don?t think there is a problem with the residuals. Just to check, the problem you see is that there?s a linear trend in the residuals vs fitted values plot when the ID random effect is included (which in a standard OLS LM would be impossible).

The reason for the correlation is that the fitted values contain the ID random effects, and these are inevitably correlated with the residuals. My intuitive understanding of this is as follows. Say some students sit a test twice, on two separate days. A student's score on a given day will be a combination of their ability (ID random effect) and unmeasured (i.e. noise) factors, like how the student was feeling on that day. Assuming that both ability and luck contribute substantially to the scores, it?s inevitable that the extreme upper end of the distribution will be populated by scores from students who are both able (high ID random effect) and were lucky on that day (high error residual). The same goes in the negative direction for the lower end of the distribution. This the basis of is regression to the mean - if we pick a student with an extreme score and re-test them, we expect their score to be less extreme. If I remember correctly it?s fairly straightforward to predict the correlation of the residuals and fitted values for a given model.

On the broader topic of checking residuals from GLMMs? 
I wrote a simple function to check residuals from lme4 fits by simulating residuals from the fitted model and plotting them on top of the real residuals. If they look similar on several simulated data sets them I?m reassured that the model fits well. This is particularly useful for non-normal GLMMs where (despite popular belief) there's no assumption of normality of the Pearson residuals.

library(devtools)
install_github("pcdjohnson/GLMMmisc") 
library(GLMMmisc)
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
sim.residplot(fm1)
# note the correlation between the residuals and the fitted values

Florian Hartig has written a more sophisticated package that uses the same basic idea called DHARMa:
https://cran.r-project.org/web/packages/DHARMa/index.html
His blog post:
https://theoreticalecology.wordpress.com/2016/08/28/dharma-an-r-package-for-residual-diagnostics-of-glmms/

All the best,
Paul
#
Hi Paul,

Thank you for your response and interest in my question. While I agree that regression to the mean does exist in these settings, I don't get why this should yield such a correlation between the BLUPs and the residuals (after all, assuming the two are totally independent, you'd still get the same phenomenon you're describing, wouldn't you?). Could you explain why this should be the case? Maybe I'm missing a big point in your explanation, if so, please forgive me.

It got me thinking however, that the correlation between the BLUPs and the residuals could arise from a fundamental constraint in the data as you suggested and I think I now understand what is going on (again, if this is what you suggested, please forgive me as I might have misunderstood your point). A short summary is that it arises from an unbalanced design in the repeated measures (as some individuals do not come back to complete the study).

This can be seen in the following graph, which shows the residuals (e) against the BLUPs (u, which also contains the effect of "visit", but it doesn't impact much the trend here), depending on whether we have 1, 2, 3 or 4 repeated measures for that individual:
https://ibb.co/dDgF3H

It should be expected that there is a perfect linear covariation for only 1 visit, because the BLUP and the residual are basically non identifiable, while this constraint is fading as more repeated measures are added to the data. Does this interpretation makes sense to you?

Thank you for your help! Also the bit about checking residuals in GLMMs, very much interesting, I'll think about DARHMa next time I'll have to do this for a GLMM!

Cheers,
Pierre

Le mardi 27 f?vrier 2018, 12:03:17 CET Paul Johnson a ?crit :
#
PIerre,

I?m going to disagree that the residuals from the without id effect do not suffer from a bad fit.

There appear to be a series of bands in https://ibb.co/fFVDNc <https://ibb.co/fFVDNc> , on the left half of the graph. If you were to change the scale (say, focusing on the region at about 59, you might see these bands have a bias as well. My suspicion is that the location of each band represents the location of individual intercept, (1|id). 

You might remove the bias by including an individual random slope, i.e (age | id). The data might also be autocorrelated.

Cheers,
Peter

  
  
#
Hi Peter,

That series of bands is most likely due to the discrete nature of all of the predictors (only "age" is quantitative here, but even then, it is a "count" predictor). This results in a "structure" in the fitted values, but as far as can tell, not in the residual values for this model without the ID effect.

Anyhow, there are not enough individual replicates (up to 4) to explain these bands with the individual effect alone.

Cheers,
Pierre

Le mardi 27 f?vrier 2018, 15:40:27 CET Peter Claussen a ?crit :