Implications of modeling residuals in multilevel models

Ben Bolker 
    on Fri, 22 Dec 2023 17:07:13 -0500 writes:

>> Dear?Ben,
    >> 
    >> These are exactly what I was?looking for! Thank you so much. Please 
    >> allow me to clarify my 2nd?question.
    >> 
    >> When I compare predict(MODEL) with predict(model_with_no_resid_modeled), 
    >> the former looks quite normal and unimodal in shape but the latter looks 
    >> multi-model (see the illustration below).
    >> 
    >> This made me think: should a good-fitting model produce normally 
    >> distributed predicted values? And does such a model do a better?job of 
    >> predicting?the value?of random draw?from the y population?

    > Ideally the *marginal* (overall) distribution of *predictions* 
    > should match the marginal distribution of the response variable (this is 
    > what the 'posterior predictive check' panel of 
    > performance::check_model() does).  

Well, only approximately and only if you have enough degrees of freedom.
Remember the simple (homo schedastic) LS case where

    Cov(Y^) = \sigma^2 * (I - P)

P = H is the projection / hat matrix  X' (X'X)^{-1} X
So, the variances of Y^_i are *smaller* than those of the Y_i

    > If the marginal distribution of your 
    > data is skewed, then the marginal distribution of your prediction should 
    > be skewed in the same way -- if it's not, then it's not doing a good job 
    > describing the data.

    > m <- lm(mpg ~ cyl + disp, mtcars)
    > performance::check_model(m, check = "pp_check")

    > This is a different story from the *residuals* (which tell you 
    > something about the conditional distribution, not the marginal 
    > distribution ...)

    >> ? ? ? ? ? ? ? ?*
    >> ? ? ? ? ? ? * * *
    >> ? ? ? ?* * * * * * *
    >> ? ?* * * * * * * * * *
    >> * * * * * * * * * * * *
    >> -------------------------? vs:
    >> ? ? ? ? ? ? ? ? ? ? ? ? *
    >> ? ? ? ? ? ? ? ? ? ? ? * *
    >> ? ? ? *? ? ? ? ? ? ?* * *
    >> ? ?* *? ?* *? ? ?* * * *
    >> * * * * * * * * * * * *
    >> -------------------------
    >> 
    >> On Fri, Dec 22, 2023 at 12:30?PM Ben Bolker <bbolker at gmail.com
>> 
    >> ? ?This slipped through the cracks.
    >>
>> > Hello All,
    >> >
    >> > Just a follow-up, can we say the model I sketched above is a
    >> location-scale
    >> > model?
    >> >
    >> > Thanks,
    >> > Simon
    >> >
    >> > On Sat, Dec 16, 2023 at 9:30?PM Simon Harmel
>> >
    >> >> Hello All,
    >> >>
    >> >> I have a highly skewed dataset.
    >> 
    >> ? ?As may have been said before, the *marginal* distribution of your
    >> data set (e.g., what you get if you plot hist(y)) is irrelevant/doesn't
    >> matter at all (the models make no assumptions about the marginal
    >> distribution)
    >> 
    >> ? But, my MODEL of choice below shows
    >> >> drastically improved, normally distributed residuals (and
    >> predicted values)
    >> >> compared to other models whose residuals are not modeled.
    >> >>
    >> >> Three quick questions:
    >> >>
    >> >> 1- Is MODEL below assuming that my data come from a population
    >> that looks
    >> >> normal once "X1_categorical" and "X2_numeric" are taken into account
    >> >> as modeled in MODEL?
    >> 
    >> ? ?Yes, the *conditional distribution* of y is Gaussian (but not IID
    >> because you have specified correlation and variance (weights)
    >> structures).
    >> 
    >> 
    >> >> 2- Do these normally distributed predictions work better for a
    >> subject
    >> >> randomly drawn from a similarly skewed population with a
    >> >> known "X1_categorical" and "X2_numeric"?
    >> 
    >> ? ?Don't think I understand what this means.
    >> 
    >> >> 3- I think the distribution of residuals mirrors that of the
    >> data. If so,
    >> >> is it correct to say MODEL below is actually **trimming** my
    >> highly skewed
    >> >> data as if it was distributed as:
    >> 
    >> ? ?That seems an odd way to put it.? I would say that, if the
    >> residuals
    >> look Normally distributed, that means that the covariates in the model
    >> are accounting for the skew.
    >> 
    >> Here's an example of a dataset where the marginal distribution is
    >> heavily skewed but the conditional distribution is (exactly) Normal:
    >> 
    >> set.seed(101)
    >> x <- rgamma(1000, shape = 1)
    >> y <- rnorm(1000, mean = x, sd = 0.1)
    >> hist(y)
    >> 
    >> qqnorm(residuals(lm(y~x)))
    >> 
    >> ? ?Heteroscedasticity (but not autocorrelation) will also change the
    >> distribution of the residuals, but I don't think it's necessary to
    >> illustrate the point here.
    >> 
    >> ? ?I think it would be fair to call this a location-scale model since
    >> you are modeling both the location (mean) and the scale (SD/variance),
    >> although often people use the term for models where there is a random
    >> effect in both the location and the scale model.
    >> 
    >> 
    >> 
    >> 
    >> 
    >> >> ```
    >> >> hist(resid(MODEL, type = "normalized"))
    >> >> ```
    >> >>
    >> >>? ?MODEL <- nlme::lme(y ~ X1_categorical + X2_numeric ...,
    >> >>? ? ? ? ? ?random = ~1| subject,
    >> >>? ? ? ? ? ?data = data,
    >> >>? ? ? ? ? ?correlation = corSymm(~1|subject),
    >> >>? ? ? ? ? ?weights = varComb(varIdent(form = ~ 1 |? X1_categorical ),
    >> >>? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? varPower(form = ~ 
    >> X2_numeric )))
    >> >>
    >> >> Thanks,
    >> >> Simon
    >> >>
    >> >
    >> >? ? ? ?[[alternative HTML version deleted]]
    >> >
    >> > _______________________________________________
    >> > R-sig-mixed-models at r-project.org
    >> <mailto:R-sig-mixed-models at r-project.org> mailing list
    >> > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
    >> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
    >> 
    >> -- 
    >> Dr. Benjamin Bolker
    >> Professor, Mathematics & Statistics and Biology, McMaster University
    >> Director, School of Computational Science and Engineering
    >> (Acting) Graduate chair, Mathematics & Statistics
    >> ?> E-mail is sent at my convenience; I don't expect replies outside of
    >> working hours.
    >> 
    >> _______________________________________________
    >> R-sig-mixed-models at r-project.org
    >> <mailto:R-sig-mixed-models at r-project.org> mailing list
    >> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
    >> <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>
    >> 

    > -- 
    > Dr. Benjamin Bolker
    > Professor, Mathematics & Statistics and Biology, McMaster University
    > Director, School of Computational Science and Engineering
    > (Acting) Graduate chair, Mathematics & Statistics
    >> E-mail is sent at my convenience; I don't expect replies outside of 
    > working hours.

    > _______________________________________________
    > R-sig-mixed-models at r-project.org mailing list
    > https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models