Model specification/family for a continuous/proportional, response with many zeros

Wed, May 19, 2021 4:38 AM

Dear Alain,

Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number of
zeros in the middle of the distribution.

sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")

All the best,
Mike


On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via

R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:

Today's Topics:

    1. Re:  Model specification/family for a continuous/proportional
       response with many zeros (Michael Lawson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME]  Model specification/family for a
      continuous/proportional response with many zeros
Message-ID:
      <

CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Hi Thierry,

I understood after your previous message that a high dispersion parameter
in beta models does not signify overdispersion. Both answers addressed

this

misconception in those two links I provided.

I suppose the real answer I am now looking for is -  how do I assess the
validity and fit of the model? Are there specifics or assumptions I must
take into account with zero-inflated beta glmms?

Just simulate 1000 data sets from your model, and see whether the
simulated data sets are comparable to your original data. One of the
things you can look at is whether the simulated data sets contain
similar number of zeros as in the observed data. I'm not sure whether
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):


https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

Thanks for your help,
Mike

On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <thierry.onkelinx at inbo.be

wrote:

Dear Mike,

I think you misread my reply. I never stated that there's something

wrong

with the model. The only "problem" I highlighted was your misconception
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE

AND

FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to

say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 at york.ac.uk>:

Dear Thierry,

Thanks for the help. So if the dispersion parameter in this model

doesn't

fit with the beta distribution, are there any alternative approaches I

can

use?

I can't seem to find much information on this elsewhere other than

these

two threads:
https://stats.stackexchange.com/a/451453/233414
https://stats.stackexchange.com/a/466951/233414

All the best,
Mike

On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't

look at

the data).

You need to look at the definition for the "over"dispersion parameter.
For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi

+ 1)

(see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
variance.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no

more

than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 at york.ac.uk

Hi Thierry,

Thank you for your advice and speedy response.

Most of the data is closer to the lower bound (0). e.g. the mean time
for group A in zone A = 15.1 seconds and group A in zone B = 3.8

seconds.

However there are a very small number of outliers near the upper

bound, the

largest being 294 out of the 300 seconds (see the attached file if

you want

to look at the data).

I have taken a stab at running a Zero-inflated Beta GLMM using

glmmTMB

in R like so:

betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
                              family = beta_family(),
                              ziformula=~1,
                              data = glmm_zone_data)

summary(beta_mod)

*Family: beta  ( logit )*






*Formula:          prop_time ~ group * zone + (1 | id)Zero inflation:
           ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
df.resid  -763.6   -736.3    388.8   -777.6      359Random
effects:Conditional model: Groups Name        Variance  Std.Dev. id
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
14Overdispersion parameter for beta family (): 13.1Conditional model:
             Estimate Std. Error z value Pr(>|z|)    (Intercept)
  -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524

-2.741

0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
***---Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model:            Estimate Std. Error z value

Pr(>|z|)

   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.

codes:  0

?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*

Does this look like the correct way of specifying the model? I am a
little confused about specifying and interpreting the zero-inflation
component - I have only just begun reading about this.

I noticed that the dispersion parameter is quite high at 13.1. I'm

not

sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant deviations

in the

dispersion (<2.2e-16) and KS tests. e.g.

DHARMa::testDispersion(beta_mod)

*DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated*

*data:  simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*



Many thanks,
Mike

On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:

Dear Michael,

Your data has bounds (lower bound at 0 and upper bound at 300) and

you

have a lot of data close to a boundary. In such a case, a continuous
distribution which ignores those bound is not a good idea. If the

time

spent outside of both zones is limited, then a long time in zone A

excludes

a long time in zone B by definition. Then I'd look towards a

multinomial

distribution. If the time spent outside both zones is dominant,

then you

can use a zero-inflated beta as you suggested. A zero-inflated

gamma might

be OK if the data is not too close to the upper boundary. If you are
considering zero-inflated beta vs zero-inflated gamma, then you

should

choose zero-inflated beta IMHO.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality

Assurance

thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may

be able

to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given

body of

data. ~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hello,

I am new to GLMMs and have a dataset where I have two distinct

groups

(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A or
zone_B
is recorded (out of a total time of 300s/observation period). For
most of
the time period the individuals are in neither zone.

I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).

Speaking to someone else, they said I should use a Binomial GLMM

using

cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).

However, the response variable is continuous (albeit with an upper
bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?

Should I convert the response into a proportion and use something
like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something

like:

prop_time ~ zone*group + (1|id)

The data is quite heavily right-skewed and contains a lot of 0's,

so

reading around it also looks like I may need to convert these into

zero-inflated/hurdle model?

Thank you for any suggestions,
Mike

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


------------------------------

End of R-sig-mixed-models Digest, Vol 173, Issue 19
***************************************************

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Model specification/family for a continuous/proportional, response with many zeros

Thread (11 messages)