Hi again,
I wanted to look a little more at the rest of the simulated values
(non-zeros), so I took the mean proportion of simulated values in set bins
of 0.01 and compared this to the real values.
*bins <- seq(from = 0, to = 1, by = 0.01)*
*sim_seq <- c()for(i in bins){ a <- mean(unlist(lapply(sim_beta_glmm,
function(x){ length(which(x>i & x <= i+0.01))/length(x)}), use.names =
FALSE)) sim_seq <- c(sim_seq,a)}real_seq <- c()for(i in bins){ a <-
print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
real_seq <- c(real_seq,a)}*
*barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
legend.text = c("simulated","real"))*
*cbind(bins,sim_seq,real_seq)*
*bins sim_seq real_seq [1,] 0.00 0.2199795081967 0.232240437
[2,] 0.01 0.1021612021858 0.166666667 [3,] 0.02 0.0750964480874
0.103825137 [4,] 0.03 0.0592128415301 0.073770492 [5,] 0.04
0.0478316939891 0.038251366 [6,] 0.05 0.0397267759563 0.030054645 [7,]
0.06 0.0331961748634 0.016393443 [8,] 0.07 0.0279322404372 0.013661202
[9,] 0.08 0.0238013661202 0.024590164 [10,] 0.09 0.0201101092896
0.010928962*
As you can see in the output above (and the plot attached), the real data
is quite a bit more right-skewed compared to the simulated values. Does
this look like a good enough fit or will I have to try a different model?
Many thanks,
Mike
On Wed, 19 May 2021 at 12:38, Michael Lawson <mrml500 at york.ac.uk> wrote:
Dear Alain,
Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number
zeros in the middle of the distribution.
sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")
All the best,
Mike
On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:
Today's Topics:
1. Re: Model specification/family for a continuous/proportional
response with many zeros (Michael Lawson)
----------------------------------------------------------------------
Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME] Model specification/family for a
continuous/proportional response with many zeros
Message-ID:
<
CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"
Hi Thierry,
I understood after your previous message that a high dispersion
in beta models does not signify overdispersion. Both answers addressed
misconception in those two links I provided.
I suppose the real answer I am now looking for is - how do I assess
validity and fit of the model? Are there specifics or assumptions I
take into account with zero-inflated beta glmms?
Just simulate 1000 data sets from your model, and see whether the
simulated data sets are comparable to your original data. One of the
things you can look at is whether the simulated data sets contain
similar number of zeros as in the observed data. I'm not sure whether
the DHARMA package can do this for you. If not..it is easy to program.
See also Figure 8 in (sorry for self-citing):
Thanks for your help,
Mike
On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <
thierry.onkelinx at inbo.be>
Dear Mike,
I think you misread my reply. I never stated that there's something
with the model. The only "problem" I highlighted was your
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.
Best regards,
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
than asking him to perform a post-mortem examination: he may be able
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
ensure that a reasonable answer can be extracted from a given body of
///////////////////////////////////////////////////////////////////////////////////////////
Dear Thierry,
Thanks for the help. So if the dispersion parameter in this model
fit with the beta distribution, are there any alternative approaches
use?
I can't seem to find much information on this elsewhere other than
thierry.onkelinx at inbo.be>
Dear Mike,
The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't
the data).
You need to look at the definition for the "over"dispersion
For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) /
(see ?glmmTMB::beta_family) Hence a large value of \phi implies a
variance.
Best regards,
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
than asking him to perform a post-mortem examination: he may be
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer
ensure that a reasonable answer can be extracted from a given body
///////////////////////////////////////////////////////////////////////////////////////////
Hi Thierry,
Thank you for your advice and speedy response.
Most of the data is closer to the lower bound (0). e.g. the mean
for group A in zone A = 15.1 seconds and group A in zone B = 3.8
However there are a very small number of outliers near the upper
largest being 294 out of the 300 seconds (see the attached file if
to look at the data).
I have taken a stab at running a Zero-inflated Beta GLMM using
in R like so:
betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
family = beta_family(),
ziformula=~1,
data = glmm_zone_data)
summary(beta_mod)
*Family: beta ( logit )*
*Formula: prop_time ~ group * zone + (1 | id)Zero
~1Data: glmm_zone_data AIC BIC logLik
df.resid -763.6 -736.3 388.8 -777.6 359Random
effects:Conditional model: Groups Name Variance Std.Dev.
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups: id,
14Overdispersion parameter for beta family (): 13.1Conditional
Estimate Std. Error z value Pr(>|z|) (Intercept)
-2.7685 0.1031 -26.844 < 2e-16 ***groupB
0.1498 -2.975 0.002932 **zonezone_B -0.4179 0.1524
0.006124 **groupB:zonezone_B 0.8443 0.2190 3.855 0.000116
***---Signif. codes: 0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model: Estimate Std. Error z value
(Intercept) -1.1804 0.1233 -9.575 <2e-16 ***---Signif.
?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*
Does this look like the correct way of specifying the model? I am
little confused about specifying and interpreting the
component - I have only just begun reading about this.
I noticed that the dispersion parameter is quite high at 13.1. I'm
sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant
dispersion (<2.2e-16) and KS tests. e.g.
DHARMa::testDispersion(beta_mod)
*DHARMa nonparametric dispersion test via sd of residuals fitted
simulated*
*data: simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*
Many thanks,
Mike
On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:
Dear Michael,
Your data has bounds (lower bound at 0 and upper bound at 300)
have a lot of data close to a boundary. In such a case, a
distribution which ignores those bound is not a good idea. If the
spent outside of both zones is limited, then a long time in zone
a long time in zone B by definition. Then I'd look towards a
distribution. If the time spent outside both zones is dominant,
can use a zero-inflated beta as you suggested. A zero-inflated
be OK if the data is not too close to the upper boundary. If you
considering zero-inflated beta vs zero-inflated gamma, then you
choose zero-inflated beta IMHO.
Best regards,
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be
more than asking him to perform a post-mortem examination: he may
to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer
not ensure that a reasonable answer can be extracted from a given
///////////////////////////////////////////////////////////////////////////////////////////
<https://www.inbo.be>
Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:
Hello,
I am new to GLMMs and have a dataset where I have two distinct
(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A
zone_B
is recorded (out of a total time of 300s/observation period).
most of
the time period the individuals are in neither zone.
I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).
Speaking to someone else, they said I should use a Binomial GLMM
cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).
However, the response variable is continuous (albeit with an
bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?
Should I convert the response into a proportion and use
like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something
prop_time ~ zone*group + (1|id)
The data is quite heavily right-skewed and contains a lot of
reading around it also looks like I may need to convert these
zero-inflated/hurdle model?
Thank you for any suggestions,
Mike
[[alternative HTML version deleted]]