proportion data with many zeros
Liz writes:
Hi Valerie, The best advice I was ever given with regards to distribution was to choose the one with the best fit i.e. no pattern in the residuals. The 2 things to think about when fitting a GLM are the type of data you've collected (binomial, counts etc) so that you can get an idea of which link will linearise your model correctly and return realistic results (non negative etc). The second is to think about the mean-variance relationship. This is what will generally show up in the residuals. Gaussian assumes no relationship (constant) but most proportion/abundance measures will have a variance which varies in some way with the mean. Try plotting your means against your variances and > have a look at the share of the distribution of your raw data. Then experiment with some suitable exponential family distributions and see which residuals > > have no pattern.
That's a good point. My variance is obviously not constant, because for some plantsno pollen was collected for some time points and I get null variance, whereas in other time points the variance is quite large.
I think you're correct in not modeling the zeroes as a hurdle - as they are not 'unknowns'. Proportion data is very tricky - I've been grappling with percent cover data for a while. Tweedie worked well for me for measures where cover values were mid to low, but not well when they were close to 100%. If i were you, i'd consider changing the way you use the data to make it simpler. Perhaps just analyse each type of pollen individually over the time periods. I assume each time period is the same for the samples and I think n=300 for each of the samples taken?
Yes, I may better use the raw counts. I anyway analyse each type of pollen individually. For some pollen types which are regularly sampled, the quasipoisson model works well, but I get problem with pollen types that are rarely sampled or not at all at some points in time. Is there a way to account for differences in mean-variance relationships in quasipoisson or negative binomial data? When I run my models with quasipoisson, the summary suggests absolutely not significant results, but when I apply the F test (as suggested in Zuur), I get a highly significant outcome.
So why not just try a quasi poisson (or negative binomial) and a tweedie GLM for each type of pollen separately vs time and see which has better residuals.
It's much easier to treat these as counts - and no need to do proportions if the n is the same for all.
Then you can get a significance value for the abundance of each pollen type with each bee at each time period. It is really the same as finding out the relative proportions.
Package tweedie on R works pretty much the same as any GLM. You just need a little but of code (in help files) to estimate an alpha (shape) parameter for each set of values. It should lie between 1-2. If not, your data is prob not suited.
I will try it. Thank you very much Best wishes Val?rie
Let me know if you need any more help. Liz
On 04/02/2013, at 2:10 AM, v_coudrain at voila.fr wrote:
Thank you Liz, I don't know tweedie, I'll have a look at it, but I have indeed some high values. I know about the problems linked to the arcsine transformation. I won't consider
it
anyway. I'd like to use either the raw values of pollen grain counts or a logistic quasibinomial model. Best, Val?rie
Message du 02/02/13 ? 20h47 De : "Liz Pryde" A : "v_coudrain at voila.fr" Copie ? : "Cade Brian" , "r-sig-ecology at r-project.org" Objet : Re: [R-sig-eco] proportion data with many zeros Have you plotted the raw data to have a look at the distribution? You could try another exponential family distribution like tweedie that has a mass at zero but is otherwise similar to poisson/gamma - so you're directly
modeling the zeroes. It won't work if you have a lot of high values though.
Proportions are tricky. Have a read of the Warton paper (2012/11?) "the arcsine is asinine". Liz On 02/02/2013, at 6:34 PM, v_coudrain at voila.fr wrote:
Thank you very much for this suggestion. In fact I reconsidered my question and I am not sure that zero-inflated model is what I need. If I understood it
properly,
a zero-inflated model is best suited when we don't know if zero values are true or false absences (right?). In my case all zero values are assumed to be
real
absence and are therefore informative. However, fitting quasipoisson on raw counts or quasibinomial on proportion gives me awful distributions of
residuals
and
meaningless results. Val?rie
Message du 01/02/13 ? 17h22 De : "Cade, Brian" A : v_coudrain at voila.fr Copie ? : r-sig-ecology at r-project.org Objet : Re: [R-sig-eco] proportion data with many zeros For a fully parametric approach, you might want to use of zero-inflated beta distribution (e.g., as available in gamlss package), which is designed for zero-inflated proportions. Or for a semi-parametric approach, you could estimated a sequence of quantile regression estimates (e.g., in package quantreg), where some interval (hopefully not to large) of the quantiles will be uninformative because they are massed at the zero values. Brian Brian S. Cade, PhD U. S. Geological Survey Fort Collins Science Center 2150 Centre Ave., Bldg. C Fort Collins, CO 80526-8818 email: brian_cade at usgs.gov tel: 970 226-9326 On Fri, Feb 1, 2013 at 1:30 AM, wrote:
Dear all, I am trying to test how the proportion of pollen of different plants found in the brood cells of a wild bee changes over time. I conducted 4 sampling sessions (thus time is a factor with 4 levels) and collected several pollen samples for each time point (300 pollen grains counted for each sample). I thought about applying a quasi-binomial glm: y = cbind(total pollen - pollen of plant X, pollen of plant X) glm(y~time, family=quasibinomial) The problem is that I have a lot of zero value, because the pollen of some plants only occurred rarely or very clumped in time. I thought about applying a zero-inflated model, but I have never used it and I am not sure if it is suitable for proportion data. Additionally I wondered if I have to consider the fact that I don't have the same number of pollen sample for each date, which makes my design unbalanced. Thank you in advance for advice. Best wishes Val?rie
___________________________________________________________ CAN 2013 : r?sultats et matchs en direct ? suivre sur Voila.fr http://sports.voila.fr/football/can/ _______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
___________________________________________________________ CAN 2013 : r?sultats et matchs en direct ? suivre sur Voila.fr http://sports.voila.fr/football/can/ _______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
___________________________________________________________ CAN 2013 : r?sultats et matchs en direct ? suivre sur Voila.fr http://sports.voila.fr/football/can/
___________________________________________________________ CAN 2013 : r?sultats et matchs en direct ? suivre sur Voila.fr http://sports.voila.fr/football/can/