An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20120823/20dceed4/attachment.pl>
Dealing with heteroscedasticity in a GLM/M
4 messages · Leila Brook, Alan Haynes, Markus Jäntti +1 more
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20120823/9b4225ae/attachment.pl>
This is, strictly speaking, the wrong approach, but in order to explore the presence of heteroscedasticity, you could try to use the linear ME functions and the variance objects in that. What I suggest by way of exploration is the following. If you regress a binomial response as if it were a countinuous variable in a standard OLS regression setting many problems arise, including out of unit interval predictions and the error term is heteroscedastic. That heteroscedasticity is of a known form however, the variance being p*(1-p) where p is x*b is the linear predictor of the probability. I would suggest you compare two models, both estimated using lme in the nlme package. One which models the response and includes a variance function that takes into account the heteroscedasticity induced by having a binary rather than continuous dependent variable. You then compare that with a model that adds, using the varComb() function, the heteroscedasticity you worry about. Markus
On 08/23/2012 07:58 AM, Leila Brook wrote:
I am hoping to find a way to account for heterogeneity of variance between categories of explanatory variables in a generalised model. I have searched books and this forum, and haven't found any advice that I understood could help account for this assumption in a generalised model context, as I can't fit a variance structure in lme4. As background to my study: I used camera stations set up in pairs (one positioned on a track and one off the track) to record my study species, and used the same pairs in each of two seasons. As my surveys were repeated, I have specified camera pair as a random effect. I am using a binomial model in lme4 to model the proportion of nights an animal was recorded, as a function of the fixed effects of season (2 categories), position (categorical: whether on or off the track), area (categorical: one of two areas) and continuous habitat variables, plus interactions between them. I validated the GLM form of the model, including plotting the deviance residuals against my explanatory variables, and have noted that the variance of residuals for the categorical variables appears to differ. Differences in the response variable between these categories were part of my research question and are evident in the raw data, so I don't want to remove them from the analysis. I have tried converting my response variable to be continuous, but when I do, it is not normal (too many zeros), nor is the log transformed variable. I have read that nlme can fit a variance structure to a LME, but haven't heard of a way of dealing with heterogeneity in a generalised mixed effect model, nor found an R package that can fit a variance structure to a GLMM. Can anyone provide any advice on how to overcome this problem, or whether I can continue, with some form of caveat, with the GLMM? Thanks in advance, Leila [[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
Markus Jantti Professor of Economics Swedish Institute for Social Research Stockholm University
4 days later
On Thu, Aug 23, 2012 at 12:58 AM, Leila Brook <leila.brook at my.jcu.edu.au> wrote:
I am hoping to find a way to account for heterogeneity of variance between categories of explanatory variables in a generalised model. I have searched books and this forum, and haven't found any advice that I understood could help account for this assumption in a generalised model context, as I can't fit a variance structure in lme4. As background to my study: I used camera stations set up in pairs (one positioned on a track and one off the track) to record my study species, and used the same pairs in each of two seasons. As my surveys were repeated, I have specified camera pair as a random effect. I am using a binomial model in lme4 to model the proportion of nights an animal was recorded, as a function of the fixed effects of season (2 categories), position (categorical: whether on or off the track), area (categorical: one of two areas) and continuous habitat variables, plus interactions between them. I validated the GLM form of the model, including plotting the deviance residuals against my explanatory variables, and have noted that the variance of residuals for the categorical variables appears to differ.
Stop there. In Logit/Probit frameworks, the variance is assumed equal for all groups. It is never estimated. The model is not identified otherwise. The effect of heteroskedasticity is not just inefficiency, but parameter bias. This makes logit models much more suspicious than previously believed. This means that all of the work you have done so far to "validate" your model is dubious and you need to take a step back. We are in a bind with logit models. Either we estimate separate models for the separate groups (to avoid heteroskedasticity), but we are not able to compare coefficients across models because there is that different, but un-estimated variance. Or we fit one model that combines the group, make the wrong assumption, and end up with wrong parameter estimates. I don't mean just a little off. I mean wrong. Its discouraging. As far as I know, this problem was first popularized by Paul Allison, Scott Long, and Richard Williams, but it is nicely surveyed in this review essay: Mood, Carina. 2010. Logistic Regression: Why We Cannot Do What We Think We Can DO, and What We Can Do About It. European Sociological Review 26(1): 67-82. That has cites to the earlier Allison paper and some of Williams's work. In my opinion, there are no completely safe approaches to dealing with the heteroskedastic group-level error. Richard Williams at Notre Dame gave an excellent presentation about it. He told me he has a paper forthcoming in the Stata journal about it, but I don't feel free to pass it along to you. But I bet his website has more information. It seems to me that if you try to "pin" one group as the "baseline variance" group and then add properly structured random effects for the other ones, you might get a handle on it. The R package dglm has suggestions like that. Good luck. If you get an answer, I'd really like to know what is the state of the art now (this minute)...
Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu