Implications of modeling residuals in multilevel models
On 2023-12-22 4:50 p.m., Simon Harmel wrote:
Dear?Ben, These are exactly what I was?looking for! Thank you so much. Please allow me to clarify my 2nd?question. When I compare predict(MODEL) with predict(model_with_no_resid_modeled), the former looks quite normal and unimodal in shape but the latter looks multi-model (see the illustration below). This made me think: should a good-fitting model produce normally distributed predicted values? And does such a model do a better?job of predicting?the value?of random draw?from the y population?
Ideally the *marginal* (overall) distribution of *predictions* should match the marginal distribution of the response variable (this is what the 'posterior predictive check' panel of performance::check_model() does). If the marginal distribution of your data is skewed, then the marginal distribution of your prediction should be skewed in the same way -- if it's not, then it's not doing a good job describing the data. m <- lm(mpg ~ cyl + disp, mtcars) performance::check_model(m, check = "pp_check") This is a different story from the *residuals* (which tell you something about the conditional distribution, not the marginal distribution ...)
? ? ? ? ? ? ? ?*
? ? ? ? ? ? * * *
? ? ? ?* * * * * * *
? ?* * * * * * * * * *
* * * * * * * * * * * *
-------------------------? vs:
? ? ? ? ? ? ? ? ? ? ? ? *
? ? ? ? ? ? ? ? ? ? ? * *
? ? ? *? ? ? ? ? ? ?* * *
? ?* *? ?* *? ? ?* * * *
* * * * * * * * * * * *
-------------------------
On Fri, Dec 22, 2023 at 12:30?PM Ben Bolker <bbolker at gmail.com
<mailto:bbolker at gmail.com>> wrote:
? ?This slipped through the cracks.
On 2023-12-22 1:10 p.m., Simon Harmel wrote:
> Hello All,
>
> Just a follow-up, can we say the model I sketched above is a
location-scale
> model?
>
> Thanks,
> Simon
>
> On Sat, Dec 16, 2023 at 9:30?PM Simon Harmel
<sim.harmel at gmail.com <mailto:sim.harmel at gmail.com>> wrote:
>
>> Hello All,
>>
>> I have a highly skewed dataset.
? ?As may have been said before, the *marginal* distribution of your
data set (e.g., what you get if you plot hist(y)) is irrelevant/doesn't
matter at all (the models make no assumptions about the marginal
distribution)
? But, my MODEL of choice below shows
>> drastically improved, normally distributed residuals (and
predicted values)
>> compared to other models whose residuals are not modeled.
>>
>> Three quick questions:
>>
>> 1- Is MODEL below assuming that my data come from a population
that looks
>> normal once "X1_categorical" and "X2_numeric" are taken into account
>> as modeled in MODEL?
? ?Yes, the *conditional distribution* of y is Gaussian (but not IID
because you have specified correlation and variance (weights)
structures).
>> 2- Do these normally distributed predictions work better for a
subject
>> randomly drawn from a similarly skewed population with a
>> known "X1_categorical" and "X2_numeric"?
? ?Don't think I understand what this means.
>> 3- I think the distribution of residuals mirrors that of the
data. If so,
>> is it correct to say MODEL below is actually **trimming** my
highly skewed
>> data as if it was distributed as:
? ?That seems an odd way to put it.? I would say that, if the
residuals
look Normally distributed, that means that the covariates in the model
are accounting for the skew.
Here's an example of a dataset where the marginal distribution is
heavily skewed but the conditional distribution is (exactly) Normal:
set.seed(101)
x <- rgamma(1000, shape = 1)
y <- rnorm(1000, mean = x, sd = 0.1)
hist(y)
qqnorm(residuals(lm(y~x)))
? ?Heteroscedasticity (but not autocorrelation) will also change the
distribution of the residuals, but I don't think it's necessary to
illustrate the point here.
? ?I think it would be fair to call this a location-scale model since
you are modeling both the location (mean) and the scale (SD/variance),
although often people use the term for models where there is a random
effect in both the location and the scale model.
>> ```
>> hist(resid(MODEL, type = "normalized"))
>> ```
>>
>>? ?MODEL <- nlme::lme(y ~ X1_categorical + X2_numeric ...,
>>? ? ? ? ? ?random = ~1| subject,
>>? ? ? ? ? ?data = data,
>>? ? ? ? ? ?correlation = corSymm(~1|subject),
>>? ? ? ? ? ?weights = varComb(varIdent(form = ~ 1 |? X1_categorical ),
>>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? varPower(form = ~
X2_numeric )))
>>
>> Thanks,
>> Simon
>>
>
>? ? ? ?[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-mixed-models at r-project.org
<mailto:R-sig-mixed-models at r-project.org> mailing list
<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models> -- Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering (Acting) Graduate chair, Mathematics & Statistics ?> E-mail is sent at my convenience; I don't expect replies outside of working hours.
_______________________________________________
R-sig-mixed-models at r-project.org
<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering (Acting) Graduate chair, Mathematics & Statistics > E-mail is sent at my convenience; I don't expect replies outside of working hours.