Efficient mixed logistic reg w 500k individuals
The problem with random forests is that they don't respect the hierarchical nature of the data, which depending on the OP's goals may or may not be a problem. That's in addition to the differences between random forests vs logistic regression even in a non hierarchical/multilevel context. Also, I think the spurious/unstable relationships bit requires some qualification. Yes, if you're looking at p-values, then with that much data, you'll typically be able to estimate trivial effects. But the solution is then not to focus on p-values. (Not saying random forests and the like aren't useful -- quite the contrary. But the motivations here are a bit of a red herring.) Phillip
On 26/12/20 7:14 am, sree datta wrote:
With such a large dataset, I would recommend exploring interactions among variables using ensemble methods such as Random Forests and Extreme Gradient Boosting (since you have a binary dependent variable). These models also correct against bias since with such a large dataset, you may end up finding a lot of spurious and unstable relationships (both in main effects and interaction effects) with such large N. In terms of processing efficiency, have you tried using the *parallel* package in R (in addition, I would also suggest *foreach* and *doParallel* package to improve processing speed). For a more detailed description of parallelism implemented in R see this article: https://www.jigsawacademy.com/handling-big-data-using-r/ (a good summary of packages). <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=icon> Virus-free. www.avast.com <https://www.avast.com/sig-email?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail&utm_term=link> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> On Wed, Dec 23, 2020 at 8:20 PM Mitchell Maltenfort <mmalten at gmail.com> wrote:
Here?s a fun one for you (I hope) I?m mucking about with a logistic regression that may have about 30 million records for half a million individuals. Yes, I have a large RAM machine - 64 Gig. And I?ve used nAGQ 0 and other recommendations from http://angrystatistician.blogspot.com/2015/10/mixed-models-in-r-bigger-faster-stronger.html?m=1 which should be reasonable for the large data. It works but I?d still be interested in tweaks to improve speed or accuracy. Any ideas? -- Sent from Gmail Mobile [[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models