Dear R-sig-mixed: I was struck today by the way the Internet has accelerated research. At one time, it might have taken a month or two to track down the articles on this problem and conclude I need to ask for advice. Now, however, I realize the need within hours. Recall the question that started us debating a few days ago was a logistic regression in which OP noticed the mis-match between the predicted probability of success and the observed fraction. We were debating that, and it had completely slipped my mind that there is a separate literature on exactly that kind of problem. Yesterday, somebody else asked me to estimate a logit model in which there were more than 40000 cases but only a few hundred "successes". That's what reminded me of the "rare events" problem and logistic regression parameter estimate bias. And I think that's the issue that we need to clear up with glmer. What do you think? Since multilevel model can be seen as a penalized ML estimation (ala Pinheiro and Bates, or as explained in Simon Wood, Generalized Additive Models), are we able to get a bias-corrected variant? Furthermore, could lme4's predict method be made to produce "good" confidence intervals. And that leads down a separate path to a huge hassle about competing ways to estimate CI's in glm and the possible need to appy extra corrections in some special cases. I'll write down that problem to ask you about it later if you help me understand this one. Here's my brief novel on what I've been Googling about for the past 10 hours or so. If it helps you, let me know. If you think I'm wrong, especially urgently let me know. To the political science audience, that's a "rare events" logistic regression problem, our most heavily cited methods paper on that is: King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137?163. http://pan.oxfordjournals.org/content/9/2/137.abstract Logistic parameter estimates (mainly the intercept) are wrong and estimated probabilities are wrong. King & Zeng provided Stata code for a function "relogit" and later adapted same for R (package: Zelig). Zelig tries to re-organize the whole regression experience for the R user, and I didn't want that, so I started looking into the various corrections to see if I couldn't write an adapter to take a glm or a glmer output and "bias correct" it. It appears, superficially at least, that I only need to adjust the intercept estimate by a weighting factor, which would be super easy to do. Quite by chance, I found this blog post by Paul Allison, and its really interesting! Logistic Regression for Rare Events (2012-02-13) http://www.statisticalhorizons.com/logistic-regression-for-rare-events And, wow, is it subtle. Read that over a few times, see if you agree with me. In a kind way, he says the "rare events" business is a red herring, and instead we need bias-corrected logistic regression estimates. Use David Firth's method. The part about the "prior correction of the intercept" discussed in King and Zeng, is not the best approach. Instead, we should see this as a symptom of the more general problem that ML estimates are biased and the bias is greatest when there are not too many "successes". Allison suggests an estimator proposed by David Firth, which used penalized ML. Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80:27?38 I don't think King and Zeng disagree, they also propose an option to bias-correct the whole vector of coefficients. That bias correction ends up addressing the more general problem. In the Stata module for relogit (the version I found was dated 1999-10-28), it says ""Relogit for Stata does not yet support the FIRTH option", but it does have an alternative weighting correction. While fiddling around to see if I could implement that, I learned it has been done in R: logistf: Firth's bias reduced logistic regression http://cran.r-project.org/web/packages/logistf/index.html That is often discussed as a solution to the problem of separation, as on the UCLA stats website, (http://www.ats.ucla.edu/stat/mult_pkg/faq/general/complete_separation_logit_models.htm) Georg Heinze and Michael Schemper, A solution to the problem of separation in logistic regression, Statistics in Medicine, 2002, vol. 21 2409-2419. But it is a two-fer, so far as I can tell. We get bias correction and separation-proofness. Heinze, G., & Puhr, R. (2010). Bias-reduced and separation-proof conditional logistic regression with small or sparse data sets. Statistics in medicine, 29(7-8), 770?777. doi:10.1002/sim.3794 The part I don't understand (yet) is how the bias correction links to mixed models. And that's why I'm asking you. OK? -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu
New Variant of Same Question: bias corrected logit estimates
3 messages · Paul Johnson, Ben Bolker, David Atkins
Paul Johnson <pauljohn32 at ...> writes:
Dear R-sig-mixed: I was struck today by the way the Internet has accelerated research. At one time, it might have taken a month or two to track down the articles on this problem and conclude I need to ask for advice. Now, however, I realize the need within hours. Recall the question that started us debating a few days ago was a logistic regression in which OP noticed the mis-match between the predicted probability of success and the observed fraction. We were debating that, and it had completely slipped my mind that there is a separate literature on exactly that kind of problem. Yesterday, somebody else asked me to estimate a logit model in which there were more than 40000 cases but only a few hundred "successes". That's what reminded me of the "rare events" problem and logistic regression parameter estimate bias. And I think that's the issue that we need to clear up with glmer. What do you think? Since multilevel model can be seen as a penalized ML estimation (ala Pinheiro and Bates, or as explained in Simon Wood, Generalized Additive Models), are we able to get a bias-corrected variant?
I don't really know the answer to the full question, but I would venture this: * There is no explicit bias-reduction capacity built into the fixed-effects estimation component of glmer * I'm aware of Firth's algorithm and have used the R implementations but haven't read the paper/don't know the details * glmer does handle some of the typical problems with 'rare events' by doing shrinkage across random effects, but if the events are rare in the *entire* data set (and not just in individual/small/ undersample regions), I don't think that will help * Vince Dorie and Andrew Gelman's blme package, or Jarrod Hadfield's MCMCglmm package, could be used with more or less informative priors to achieve a degree of shrinkage. I don't know whether there's a clever way to adapt glmer itself to do shrinkage/bias correction on a single sample. Hopefully others with more knowledge will chime in.
Paul-- I should state upfront that I didn't read the previous thread closely, but I *thought* that the primary issue related to conditional vs. marginal effects -- where GLMMs (with non-identity link) functions yield conditional fixed-effects (i.e., they do not 'average over' the random-effects, but are conditional on particular values of the random-effects). This shows up periodically on the listserv, e.g., https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015736.html Though, perhaps your point below was in the later traffic in that thread (and if so, please disregard!). cheers, Dave
Dave Atkins, PhD Department of Psychiatry and Behavioral Science University of Washington datkins at u.washington.edu 206-616-3879 http://depts.washington.edu/cshrb/ "We are drowning in information and starving for knowledge." Rutherford Roger Paul wrote: Dear R-sig-mixed: I was struck today by the way the Internet has accelerated research. At one time, it might have taken a month or two to track down the articles on this problem and conclude I need to ask for advice. Now, however, I realize the need within hours. Recall the question that started us debating a few days ago was a logistic regression in which OP noticed the mis-match between the predicted probability of success and the observed fraction. We were debating that, and it had completely slipped my mind that there is a separate literature on exactly that kind of problem. Yesterday, somebody else asked me to estimate a logit model in which there were more than 40000 cases but only a few hundred "successes". That's what reminded me of the "rare events" problem and logistic regression parameter estimate bias. And I think that's the issue that we need to clear up with glmer. What do you think? Since multilevel model can be seen as a penalized ML estimation (ala Pinheiro and Bates, or as explained in Simon Wood, Generalized Additive Models), are we able to get a bias-corrected variant? Furthermore, could lme4's predict method be made to produce "good" confidence intervals. And that leads down a separate path to a huge hassle about competing ways to estimate CI's in glm and the possible need to appy extra corrections in some special cases. I'll write down that problem to ask you about it later if you help me understand this one. Here's my brief novel on what I've been Googling about for the past 10 hours or so. If it helps you, let me know. If you think I'm wrong, especially urgently let me know. To the political science audience, that's a "rare events" logistic regression problem, our most heavily cited methods paper on that is: King, G., & Zeng, L. (2001). Logistic Regression in Rare Events Data. Political Analysis, 9(2), 137?163. http://pan.oxfordjournals.org/content/9/2/137.abstract Logistic parameter estimates (mainly the intercept) are wrong and estimated probabilities are wrong. King & Zeng provided Stata code for a function "relogit" and later adapted same for R (package: Zelig). Zelig tries to re-organize the whole regression experience for the R user, and I didn't want that, so I started looking into the various corrections to see if I couldn't write an adapter to take a glm or a glmer output and "bias correct" it. It appears, superficially at least, that I only need to adjust the intercept estimate by a weighting factor, which would be super easy to do. Quite by chance, I found this blog post by Paul Allison, and its really interesting! Logistic Regression for Rare Events (2012-02-13) http://www.statisticalhorizons.com/logistic-regression-for-rare-events And, wow, is it subtle. Read that over a few times, see if you agree with me. In a kind way, he says the "rare events" business is a red herring, and instead we need bias-corrected logistic regression estimates. Use David Firth's method. The part about the "prior correction of the intercept" discussed in King and Zeng, is not the best approach. Instead, we should see this as a symptom of the more general problem that ML estimates are biased and the bias is greatest when there are not too many "successes". Allison suggests an estimator proposed by David Firth, which used penalized ML. Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80:27?38 I don't think King and Zeng disagree, they also propose an option to bias-correct the whole vector of coefficients. That bias correction ends up addressing the more general problem. In the Stata module for relogit (the version I found was dated 1999-10-28), it says ""Relogit for Stata does not yet support the FIRTH option", but it does have an alternative weighting correction. While fiddling around to see if I could implement that, I learned it has been done in R: logistf: Firth's bias reduced logistic regression http://cran.r-project.org/web/packages/logistf/index.html That is often discussed as a solution to the problem of separation, as on the UCLA stats website, (http://www.ats.ucla.edu/stat/mult_pkg/faq/general/complete_separation_logit_models.htm) Georg Heinze and Michael Schemper, A solution to the problem of separation in logistic regression, Statistics in Medicine, 2002, vol. 21 2409-2419. But it is a two-fer, so far as I can tell. We get bias correction and separation-proofness. Heinze, G., & Puhr, R. (2010). Bias-reduced and separation-proof conditional logistic regression with small or sparse data sets. Statistics in medicine, 29(7-8), 770?777. doi:10.1002/sim.3794 The part I don't understand (yet) is how the bias correction links to mixed models. And that's why I'm asking you. OK? -- Paul E. Johnson Professor, Political Science Assoc. Director 1541 Lilac Lane, Room 504 Center for Research Methods University of Kansas University of Kansas http://pj.freefaculty.org http://quant.ku.edu