I am looking to model quality of life (QOL) as a DV over time. The DV shows strong negative skew. I am wondering about the best way to handle this (more detail below). Frequency distribution of QOL and example code are also at the end of this message. Many participants just say that their quality of life is great, and thus there is a ceiling effect with many values clustered at the highest value. While the distribution resembles y=e^x, I have not been able to fit a distribution via GLMM that results in normally distributed and homoskedastic residuals (including gamma and inverse gaussian). A number of DV transformations have not worked either (e.g., log, exponential, Box-Cox), in large part because of the large proportion of values at the maximum level of QOL, which creates a spike at the end of the distribution. I could try zero-inflated models by transforming the dv (multiply by -1 and put the starting value at 0), but even then there will still be a disproportionate number of values clustered at one end. My question: I am particularly interested in fixed effects parameters from a longitudinal model, and was thinking of testing these parameters by using percentile bootstrap CIs via confint(). However, the residuals from a lmer model are both non-normal and heteroskedastic - will percentile bootstrap of beta coefficients address this, or can only the wild bootstrap address these issues (as it is targeted to residuals)? I have a basic understanding of the bootstrap but am not an expert regarding its use in linear models. Many thanks! # Example lmer code model <- lmer(QOL ~ poly(time, 2) + (time | ID), data=dataset, REML = FALSE ) # Frequency distribution QOL valid_percent 25 0.000308261 30 0.000308261 32 0.000308261 34 0.000616523 38 0.000308261 41 0.000308261 45 0.000308261 46 0.000308261 47 0.000308261 48 0.000616523 49 0.000616523 50 0.000616523 51 0.000308261 52 0.000308261 53 0.001541307 54 0.000616523 55 0.001233046 56 0.000616523 57 0.000924784 58 0.000308261 59 0.000924784 60 0.000924784 61 0.001849568 62 0.001541307 63 0.003082614 64 0.001849568 65 0.00215783 66 0.002466091 67 0.004007398 68 0.002466091 69 0.004007398 70 0.002466091 71 0.003699137 72 0.006781751 73 0.004932183 74 0.006781751 75 0.006165228 76 0.007090012 77 0.007706535 78 0.008631319 79 0.010789149 80 0.015104809 81 0.014488286 82 0.01541307 83 0.020345253 84 0.025893958 85 0.03298397 86 0.036066585 87 0.053020962 88 0.064426634 89 0.080147966 90 0.088779285 91 0.452219482
Non-Normal and Heteroskedastic Residuals in Longitudinal Model Due to Non-Normal DV - Percentile Bootstrap Sufficient, or Wild Bootstrap Needed?
3 messages · Philippi, Tom, David Jones
3 days later
David-- I apologize in advance for not answering your precise question, but no one else has responded, and this response might be more helpful than nothing. If I understand your frequency data, nearly half of your observations are tied at the extreme value of 91. No transform is going to make that distribution approximately normal. Without rather large sample sizes, most forms of bootstrapping will not produce confidence intervals with nominal and symmetric coverage. Further, modeling changes in the _mean_ of such values can muddle or mislead on changes over time. If you are primarily interested in the fixed effects, would quantile regression perhaps address your questions of interest? I don't know "quality of life", but in my field, when I have oddly-distributed response variables, I'm almost always interested in more than the mean, as the temporal changes are more than a simple shift of the entire distribution. For your example data, if 45% of the responses were 91, then longitudinal trends in a mean are driven by a mixture of changes in that fraction plus shifts in the length or width of the tail of lower values. Quantile regression on the lower quantiles (the median in the above data is 90) might be more informative, as well as more applicable to such data. If subjects either converge on high scores over time, or start out with high scores but then diverge as some fraction of subjects accumulate health problems and have their scores decline over time, quantile regression might better characterize such changes. I have used lqmm with longitudinal data on limpet sizes with fixed plots as random effects, and am exploring it for temporal trends in water quality The vignette for lqmm uses the Orthodont data from nlme, and includes the equivalent of (1 + time | subject) as a random effect. lqmm includes a bootstrap function for objects of class lqm or lqmm. I have yet to simulate highly skewed or mixture model WQ data to see if (when) bootstrapped confidence intervals have reasonable coverage, but that is in the queue for this fall. Also, perhaps the real experts on this list can chime in on the form of your model. While I understand mixed models with linear terms for time as a fixed effect and within-subject random effect, I'm not clear on what linear and quadratic fixed effect terms but only linear within-subject terms means, especially if subjects differ in starting or drop-out times. My apologies for not directly answering your question. And certainly your mileage will vary. Tom "To do science is to search for repeated patterns, not simply to accumulate facts..." --Robert MacArthur 1972, Geographical Ecology "Statistical methods of analysis are intended to aid the interpretation of data that are subject to appreciable haphazard variability" --Cox & Hinkley 1974; Theoretical Statistics On Mon, Jun 11, 2018 at 6:59 AM, David Jones <david.tn.jones at gmail.com> wrote:
I am looking to model quality of life (QOL) as a DV over time. The DV shows strong negative skew. I am wondering about the best way to handle this (more detail below). Frequency distribution of QOL and example code are also at the end of this message. Many participants just say that their quality of life is great, and thus there is a ceiling effect with many values clustered at the highest value. While the distribution resembles y=e^x, I have not been able to fit a distribution via GLMM that results in normally distributed and homoskedastic residuals (including gamma and inverse gaussian). A number of DV transformations have not worked either (e.g., log, exponential, Box-Cox), in large part because of the large proportion of values at the maximum level of QOL, which creates a spike at the end of the distribution. I could try zero-inflated models by transforming the dv (multiply by -1 and put the starting value at 0), but even then there will still be a disproportionate number of values clustered at one end. My question: I am particularly interested in fixed effects parameters from a longitudinal model, and was thinking of testing these parameters by using percentile bootstrap CIs via confint(). However, the residuals from a lmer model are both non-normal and heteroskedastic - will percentile bootstrap of beta coefficients address this, or can only the wild bootstrap address these issues (as it is targeted to residuals)? I have a basic understanding of the bootstrap but am not an expert regarding its use in linear models. Many thanks! # Example lmer code model <- lmer(QOL ~ poly(time, 2) + (time | ID), data=dataset, REML = FALSE ) # Frequency distribution QOL valid_percent 25 0.000308261 30 0.000308261 32 0.000308261 34 0.000616523 38 0.000308261 41 0.000308261 45 0.000308261 46 0.000308261 47 0.000308261 48 0.000616523 49 0.000616523 50 0.000616523 51 0.000308261 52 0.000308261 53 0.001541307 54 0.000616523 55 0.001233046 56 0.000616523 57 0.000924784 58 0.000308261 59 0.000924784 60 0.000924784 61 0.001849568 62 0.001541307 63 0.003082614 64 0.001849568 65 0.00215783 66 0.002466091 67 0.004007398 68 0.002466091 69 0.004007398 70 0.002466091 71 0.003699137 72 0.006781751 73 0.004932183 74 0.006781751 75 0.006165228 76 0.007090012 77 0.007706535 78 0.008631319 79 0.010789149 80 0.015104809 81 0.014488286 82 0.01541307 83 0.020345253 84 0.025893958 85 0.03298397 86 0.036066585 87 0.053020962 88 0.064426634 89 0.080147966 90 0.088779285 91 0.452219482
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Hi Tom, Thank you for your detailed follow up. You are correct that many of the observations are at the extreme value. I am fortunate to have a fairly large sample (~700 participants with roughly 8 timepoints each), and I would be hopeful that bootstrapping could come to the rescue. That being said, it's a tricky situation as you suggest. I had not considered quantile regression, and a mixed quantile approach might be a great way to get at this. I am very grateful for this overall suggestion as well as the specifics to look for in the lqmm vignette (and how it corresponds to nlme). It is a difficult analytic situation and your input has been very helpful. David
On Thu, Jun 14, 2018 at 12:54 PM, Philippi, Tom <tom_philippi at nps.gov> wrote:
David-- I apologize in advance for not answering your precise question, but no one else has responded, and this response might be more helpful than nothing. If I understand your frequency data, nearly half of your observations are tied at the extreme value of 91. No transform is going to make that distribution approximately normal. Without rather large sample sizes, most forms of bootstrapping will not produce confidence intervals with nominal and symmetric coverage. Further, modeling changes in the _mean_ of such values can muddle or mislead on changes over time. If you are primarily interested in the fixed effects, would quantile regression perhaps address your questions of interest? I don't know "quality of life", but in my field, when I have oddly-distributed response variables, I'm almost always interested in more than the mean, as the temporal changes are more than a simple shift of the entire distribution. For your example data, if 45% of the responses were 91, then longitudinal trends in a mean are driven by a mixture of changes in that fraction plus shifts in the length or width of the tail of lower values. Quantile regression on the lower quantiles (the median in the above data is 90) might be more informative, as well as more applicable to such data. If subjects either converge on high scores over time, or start out with high scores but then diverge as some fraction of subjects accumulate health problems and have their scores decline over time, quantile regression might better characterize such changes. I have used lqmm with longitudinal data on limpet sizes with fixed plots as random effects, and am exploring it for temporal trends in water quality The vignette for lqmm uses the Orthodont data from nlme, and includes the equivalent of (1 + time | subject) as a random effect. lqmm includes a bootstrap function for objects of class lqm or lqmm. I have yet to simulate highly skewed or mixture model WQ data to see if (when) bootstrapped confidence intervals have reasonable coverage, but that is in the queue for this fall. Also, perhaps the real experts on this list can chime in on the form of your model. While I understand mixed models with linear terms for time as a fixed effect and within-subject random effect, I'm not clear on what linear and quadratic fixed effect terms but only linear within-subject terms means, especially if subjects differ in starting or drop-out times. My apologies for not directly answering your question. And certainly your mileage will vary. Tom "To do science is to search for repeated patterns, not simply to accumulate facts..." --Robert MacArthur 1972, Geographical Ecology "Statistical methods of analysis are intended to aid the interpretation of data that are subject to appreciable haphazard variability" --Cox & Hinkley 1974; Theoretical Statistics On Mon, Jun 11, 2018 at 6:59 AM, David Jones <david.tn.jones at gmail.com> wrote:
I am looking to model quality of life (QOL) as a DV over time. The DV shows strong negative skew. I am wondering about the best way to handle this (more detail below). Frequency distribution of QOL and example code are also at the end of this message. Many participants just say that their quality of life is great, and thus there is a ceiling effect with many values clustered at the highest value. While the distribution resembles y=e^x, I have not been able to fit a distribution via GLMM that results in normally distributed and homoskedastic residuals (including gamma and inverse gaussian). A number of DV transformations have not worked either (e.g., log, exponential, Box-Cox), in large part because of the large proportion of values at the maximum level of QOL, which creates a spike at the end of the distribution. I could try zero-inflated models by transforming the dv (multiply by -1 and put the starting value at 0), but even then there will still be a disproportionate number of values clustered at one end. My question: I am particularly interested in fixed effects parameters from a longitudinal model, and was thinking of testing these parameters by using percentile bootstrap CIs via confint(). However, the residuals from a lmer model are both non-normal and heteroskedastic - will percentile bootstrap of beta coefficients address this, or can only the wild bootstrap address these issues (as it is targeted to residuals)? I have a basic understanding of the bootstrap but am not an expert regarding its use in linear models. Many thanks! # Example lmer code model <- lmer(QOL ~ poly(time, 2) + (time | ID), data=dataset, REML = FALSE ) # Frequency distribution QOL valid_percent 25 0.000308261 30 0.000308261 32 0.000308261 34 0.000616523 38 0.000308261 41 0.000308261 45 0.000308261 46 0.000308261 47 0.000308261 48 0.000616523 49 0.000616523 50 0.000616523 51 0.000308261 52 0.000308261 53 0.001541307 54 0.000616523 55 0.001233046 56 0.000616523 57 0.000924784 58 0.000308261 59 0.000924784 60 0.000924784 61 0.001849568 62 0.001541307 63 0.003082614 64 0.001849568 65 0.00215783 66 0.002466091 67 0.004007398 68 0.002466091 69 0.004007398 70 0.002466091 71 0.003699137 72 0.006781751 73 0.004932183 74 0.006781751 75 0.006165228 76 0.007090012 77 0.007706535 78 0.008631319 79 0.010789149 80 0.015104809 81 0.014488286 82 0.01541307 83 0.020345253 84 0.025893958 85 0.03298397 86 0.036066585 87 0.053020962 88 0.064426634 89 0.080147966 90 0.088779285 91 0.452219482
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models