I am trying to analyse some data I have on the presence/absence of parasite infestation on small mammals using a GLMM, however I have a severely unbalanced data set in that I have a large number of 0's compared to 1's (i.e. 1333 0's and 86 1's). The response variable (presence/absence) is at the individual level whereas all the explanatory variables (apart from sex) are at the site level. This means that a lot of the individuals have exactly the same combination of all explanatory variables and when there is so many individuals with 0's it leaves very little power. When I reduce the model I find that I can remove a number of interactions terms without really affecting the AIC which lead me to be slightly concerned. One option would be to analyses the data at the site level, i.e parasite prevalence, rather than the probability of being infested. Any advice as to how to deal with this unbalanced data set would be very much appreciated. Anna Renwick Institute of Biological & Environment Sciences University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ The University of Aberdeen is a charity registered in Scotland, No SC013683.
Unbalanced presence/absence data
3 messages · Renwick, A. R., Andrew J Tyre, Ken Beath
Hi Anna, if your covariates are at the site level, then I suggest reducing your sample to a pure binomial case - counts of individuals with and without parasites. This is exactly the case when you will run into large amounts of overdispersion, because between individual differences in susceptibility and exposure within sites lead to larger than binomial variation between sites. However, you can at least partially account for this by including a random effect of site in the model - this leads to the "normal-binomial" model discussed in earlier posts (how do you all find those earlier posts?). hth, Drew Tyre School of Natural Resources University of Nebraska-Lincoln 416 Hardin Hall, East Campus 3310 Holdrege Street Lincoln, NE 68583-0974 phone: +1 402 472 4054 fax: +1 402 472 2946 email: atyre2 at unl.edu http://snr.unl.edu/tyre "Renwick, A. R." <a.renwick at abdn.ac.uk> Sent by: r-sig-mixed-models-bounces at r-project.org 02/03/2009 08:33 AM To "'r-sig-mixed-models at r-project.org'" <r-sig-mixed-models at r-project.org> cc Subject [R-sig-ME] Unbalanced presence/absence data I am trying to analyse some data I have on the presence/absence of parasite infestation on small mammals using a GLMM, however I have a severely unbalanced data set in that I have a large number of 0's compared to 1's (i.e. 1333 0's and 86 1's). The response variable (presence/absence) is at the individual level whereas all the explanatory variables (apart from sex) are at the site level. This means that a lot of the individuals have exactly the same combination of all explanatory variables and when there is so many individuals with 0's it leaves very little power. When I reduce the model I find that I can remove a number of interactions terms without really affecting the AIC which lead me to be slightly concerned. One option would be to analyses the data at the site level, i.e parasite prevalence, rather than the probability of being infested. Any advice as to how to deal with this unbalanced data set would be very much appreciated. Anna Renwick Institute of Biological & Environment Sciences University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ The University of Aberdeen is a charity registered in Scotland, No SC013683. _______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
5 days later
On 04/02/2009, at 1:22 AM, Renwick, A. R. wrote:
I am trying to analyse some data I have on the presence/absence of parasite infestation on small mammals using a GLMM, however I have a severely unbalanced data set in that I have a large number of 0's compared to 1's (i.e. 1333 0's and 86 1's). The response variable (presence/absence) is at the individual level whereas all the explanatory variables (apart from sex) are at the site level. This means that a lot of the individuals have exactly the same combination of all explanatory variables and when there is so many individuals with 0's it leaves very little power.
This shouldn't be a problem, what you may need is to use the nAGQ parameter to increase the number of quadrature points, and avoid any numerical problems. This is especially important if there is high correlation between individuals within a site. Also unbalanced means something different to what you have.
When I reduce the model I find that I can remove a number of interactions terms without really affecting the AIC which lead me to be slightly concerned.
This most likely means the interactions are not significant.
One option would be to analyses the data at the site level, i.e parasite prevalence, rather than the probability of being infested.
While you can do this, it is throwing away information, possibly a lot of information. Ken
Any advice as to how to deal with this unbalanced data set would be very much appreciated. Anna Renwick Institute of Biological & Environment Sciences University of Aberdeen Zoology Building Tillydrone Avenue Aberdeen AB24 2TZ The University of Aberdeen is a charity registered in Scotland, No SC013683.
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models