Model specification/family for a continuous/proportional, response with many zeros
Dear Alain,
Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number of
zeros in the middle of the distribution.
sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")
All the best,
Mike
On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:
Today's Topics:
1. Re: Model specification/family for a continuous/proportional
response with many zeros (Michael Lawson)
----------------------------------------------------------------------
Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME] Model specification/family for a
continuous/proportional response with many zeros
Message-ID:
<
CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8" Hi Thierry, I understood after your previous message that a high dispersion parameter in beta models does not signify overdispersion. Both answers addressed
this
misconception in those two links I provided. I suppose the real answer I am now looking for is - how do I assess the validity and fit of the model? Are there specifics or assumptions I must take into account with zero-inflated beta glmms?
Just simulate 1000 data sets from your model, and see whether the simulated data sets are comparable to your original data. One of the things you can look at is whether the simulated data sets contain similar number of zeros as in the observed data. I'm not sure whether the DHARMA package can do this for you. If not..it is easy to program. See also Figure 8 in (sorry for self-citing): https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577 Alain
Thanks for your help, Mike On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <thierry.onkelinx at inbo.be wrote:
Dear Mike, I think you misread my reply. I never stated that there's something
wrong
with the model. The only "problem" I highlighted was your misconception about the "high overdispersion". In this case, a high parameter value indicates a low variance, which is what we mostly want to see. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND
FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to
say
what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////
<https://www.inbo.be> Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 at york.ac.uk>:
Dear Thierry, Thanks for the help. So if the dispersion parameter in this model
doesn't
fit with the beta distribution, are there any alternative approaches I
can
use? I can't seem to find much information on this elsewhere other than
these
two threads: https://stats.stackexchange.com/a/451453/233414 https://stats.stackexchange.com/a/466951/233414 All the best, Mike On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <
thierry.onkelinx at inbo.be>
wrote:
Dear Mike, The zero-inflation is specified on the logit scale. plogis(-1.18) = 0.235 23.5% zero seems reasonable when reading your story. (Didn't
look at
the data). You need to look at the definition for the "over"dispersion parameter. For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi
+ 1)
(see ?glmmTMB::beta_family) Hence a large value of \phi implies a low variance. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more
than asking him to perform a post-mortem examination: he may be able
to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does
not
ensure that a reasonable answer can be extracted from a given body of
data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////
<https://www.inbo.be> Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 at york.ac.uk
:
Hi Thierry, Thank you for your advice and speedy response. Most of the data is closer to the lower bound (0). e.g. the mean time for group A in zone A = 15.1 seconds and group A in zone B = 3.8
seconds.
However there are a very small number of outliers near the upper
bound, the
largest being 294 out of the 300 seconds (see the attached file if
you want
to look at the data). I have taken a stab at running a Zero-inflated Beta GLMM using
glmmTMB
in R like so:
betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
family = beta_family(),
ziformula=~1,
data = glmm_zone_data)
summary(beta_mod)
*Family: beta ( logit )*
*Formula: prop_time ~ group * zone + (1 | id)Zero inflation:
~1Data: glmm_zone_data AIC BIC logLik deviance
df.resid -763.6 -736.3 388.8 -777.6 359Random
effects:Conditional model: Groups Name Variance Std.Dev. id
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups: id,
14Overdispersion parameter for beta family (): 13.1Conditional model:
Estimate Std. Error z value Pr(>|z|) (Intercept)
-2.7685 0.1031 -26.844 < 2e-16 ***groupB -0.4455
0.1498 -2.975 0.002932 **zonezone_B -0.4179 0.1524
-2.741
0.006124 **groupB:zonezone_B 0.8443 0.2190 3.855 0.000116 ***---Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1Zero-inflation model: Estimate Std. Error z value
Pr(>|z|)
(Intercept) -1.1804 0.1233 -9.575 <2e-16 ***---Signif.
codes: 0
?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1* Does this look like the correct way of specifying the model? I am a little confused about specifying and interpreting the zero-inflation component - I have only just begun reading about this. I noticed that the dispersion parameter is quite high at 13.1. I'm
not
sure if this matters for beta models?. I tried running DHARMa simulateResiduals on the model output and got significant deviations
in the
dispersion (<2.2e-16) and KS tests. e.g. DHARMa::testDispersion(beta_mod) *DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated* *data: simulationOutput* *ratioObsSim = 1.3612, p-value < 2.2e-16* *alternative hypothesis: two.sided* Many thanks, Mike On Mon, 17 May 2021 at 13:22, Thierry Onkelinx < thierry.onkelinx at inbo.be> wrote:
Dear Michael, Your data has bounds (lower bound at 0 and upper bound at 300) and
you
have a lot of data close to a boundary. In such a case, a continuous distribution which ignores those bound is not a good idea. If the
time
spent outside of both zones is limited, then a long time in zone A
excludes
a long time in zone B by definition. Then I'd look towards a
multinomial
distribution. If the time spent outside both zones is dominant,
then you
can use a zero-inflated beta as you suggested. A zero-inflated
gamma might
be OK if the data is not too close to the upper boundary. If you are considering zero-inflated beta vs zero-inflated gamma, then you
should
choose zero-inflated beta IMHO. Best regards, ir. Thierry Onkelinx Statisticus / Statistician Vlaamse Overheid / Government of Flanders INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
NATURE
AND FOREST Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
Assurance
thierry.onkelinx at inbo.be Havenlaan 88 bus 73, 1000 Brussel www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may
be able
to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given
body of
data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////
<https://www.inbo.be> Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via R-sig-mixed-models <r-sig-mixed-models at r-project.org>:
Hello, I am new to GLMMs and have a dataset where I have two distinct
groups
(A and B) of 7 individuals each. The data consists of repeated measurements of each individual where the amount of time spent at either zone_A or zone_B is recorded (out of a total time of 300s/observation period). For most of the time period the individuals are in neither zone. I want to test if group A and group B spend more time in zone A compared to zone B (and vice versa). Speaking to someone else, they said I should use a Binomial GLMM
using
cbind. i.e. cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id). However, the response variable is continuous (albeit with an upper bound of 300 seconds per observation period), so I'm not sure if this is appropriate? Should I convert the response into a proportion and use something like a Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something
like:
prop_time ~ zone*group + (1|id) The data is quite heavily right-skewed and contains a lot of 0's,
so
reading around it also looks like I may need to convert these into
a
zero-inflated/hurdle model?
Thank you for any suggestions,
Mike
[[alternative HTML version deleted]]
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
[[alternative HTML version deleted]] ------------------------------ Subject: Digest Footer
_______________________________________________ R-sig-mixed-models mailing list R-sig-mixed-models at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models ------------------------------ End of R-sig-mixed-models Digest, Vol 173, Issue 19 ***************************************************
-- Dr. Alain F. Zuur Highland Statistics Ltd. 9 St Clair Wynd AB41 6DZ Newburgh, UK Email: highstat at highstat.com URL: www.highstat.com
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models