Model specification/family for a continuous/proportional, response with many zeros - R-SIG-mixed-models

Wed, May 19, 2021 3:18 AM #

Just simulate 1000 data sets from your model, and see whether the 
simulated data sets are comparable to your original data. One of the 
things you can look at is whether the simulated data sets contain 
similar number of zeros as in the observed data. I'm not sure whether 
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):

https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

Thanks for your help,
Mike

On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <thierry.onkelinx at inbo.be>
wrote:

Dear Mike,

I think you misread my reply. I never stated that there's something wrong
with the model. The only "problem" I highlighted was your misconception
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be


///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 at york.ac.uk>:

Dear Thierry,

Thanks for the help. So if the dispersion parameter in this model doesn't
fit with the beta distribution, are there any alternative approaches I can
use?

I can't seem to find much information on this elsewhere other than these
two threads:
https://stats.stackexchange.com/a/451453/233414
https://stats.stackexchange.com/a/466951/233414

All the best,
Mike

On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <thierry.onkelinx at inbo.be>
wrote:

Dear Mike,

The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't look at
the data).

You need to look at the definition for the "over"dispersion parameter.
For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi + 1)
(see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
variance.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be


///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 at york.ac.uk>:

Hi Thierry,

Thank you for your advice and speedy response.

Most of the data is closer to the lower bound (0). e.g. the mean time
for group A in zone A = 15.1 seconds and group A in zone B = 3.8 seconds.
However there are a very small number of outliers near the upper bound, the
largest being 294 out of the 300 seconds (see the attached file if you want
to look at the data).

I have taken a stab at running a Zero-inflated Beta GLMM using glmmTMB
in R like so:

betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
                              family = beta_family(),
                              ziformula=~1,
                              data = glmm_zone_data)

summary(beta_mod)

*Family: beta  ( logit )*






*Formula:          prop_time ~ group * zone + (1 | id)Zero inflation:
           ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
df.resid  -763.6   -736.3    388.8   -777.6      359Random
effects:Conditional model: Groups Name        Variance  Std.Dev. id
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
14Overdispersion parameter for beta family (): 13.1Conditional model:
             Estimate Std. Error z value Pr(>|z|)    (Intercept)
  -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524  -2.741
0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
***---Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model:            Estimate Std. Error z value Pr(>|z|)
   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif. codes:  0
?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*

Does this look like the correct way of specifying the model? I am a
little confused about specifying and interpreting the zero-inflation
component - I have only just begun reading about this.

I noticed that the dispersion parameter is quite high at 13.1. I'm not
sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant deviations in the
dispersion (<2.2e-16) and KS tests. e.g.

DHARMa::testDispersion(beta_mod)

*DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated*

*data:  simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*



Many thanks,
Mike

On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:

Dear Michael,

Your data has bounds (lower bound at 0 and upper bound at 300) and you
have a lot of data close to a boundary. In such a case, a continuous
distribution which ignores those bound is not a good idea. If the time
spent outside of both zones is limited, then a long time in zone A excludes
a long time in zone B by definition. Then I'd look towards a multinomial
distribution. If the time spent outside both zones is dominant, then you
can use a zero-inflated beta as you suggested. A zero-inflated gamma might
be OK if the data is not too close to the upper boundary. If you are
considering zero-inflated beta vs zero-inflated gamma, then you should
choose zero-inflated beta IMHO.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be able
to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body of
data. ~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>

Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hello,

I am new to GLMMs and have a dataset where I have two distinct groups
(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A or
zone_B
is recorded (out of a total time of 300s/observation period). For
most of
the time period the individuals are in neither zone.

I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).

Speaking to someone else, they said I should use a Binomial GLMM using
cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).

However, the response variable is continuous (albeit with an upper
bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?

Should I convert the response into a proportion and use something
like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something like:
prop_time ~ zone*group + (1|id)

The data is quite heavily right-skewed and contains a lot of 0's, so
reading around it also looks like I may need to convert these into a
zero-inflated/hurdle model?

Thank you for any suggestions,
Mike

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


------------------------------

End of R-sig-mixed-models Digest, Vol 173, Issue 19
***************************************************

Dr. Alain F. Zuur
Highland Statistics Ltd.
9 St Clair Wynd
AB41 6DZ Newburgh, UK
Email: highstat at highstat.com
URL:   www.highstat.com

Michael Lawson

Wed, May 19, 2021 4:38 AM #

Dear Alain,

Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number of
zeros in the middle of the distribution.

sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")

All the best,
Mike


On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via

R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:

Today's Topics:

    1. Re:  Model specification/family for a continuous/proportional
       response with many zeros (Michael Lawson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME]  Model specification/family for a
      continuous/proportional response with many zeros
Message-ID:
      <

CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Hi Thierry,

I understood after your previous message that a high dispersion parameter
in beta models does not signify overdispersion. Both answers addressed

this

misconception in those two links I provided.

I suppose the real answer I am now looking for is -  how do I assess the
validity and fit of the model? Are there specifics or assumptions I must
take into account with zero-inflated beta glmms?

Just simulate 1000 data sets from your model, and see whether the
simulated data sets are comparable to your original data. One of the
things you can look at is whether the simulated data sets contain
similar number of zeros as in the observed data. I'm not sure whether
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):


https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

Thanks for your help,
Mike

On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <thierry.onkelinx at inbo.be

wrote:

Dear Mike,

I think you misread my reply. I never stated that there's something

wrong

with the model. The only "problem" I highlighted was your misconception
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE

AND

FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to

say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 at york.ac.uk>:

Dear Thierry,

Thanks for the help. So if the dispersion parameter in this model

doesn't

fit with the beta distribution, are there any alternative approaches I

can

use?

I can't seem to find much information on this elsewhere other than

these

two threads:
https://stats.stackexchange.com/a/451453/233414
https://stats.stackexchange.com/a/466951/233414

All the best,
Mike

On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't

look at

the data).

You need to look at the definition for the "over"dispersion parameter.
For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) / (\phi

+ 1)

(see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
variance.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no

more

than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <mrml500 at york.ac.uk

Hi Thierry,

Thank you for your advice and speedy response.

Most of the data is closer to the lower bound (0). e.g. the mean time
for group A in zone A = 15.1 seconds and group A in zone B = 3.8

seconds.

However there are a very small number of outliers near the upper

bound, the

largest being 294 out of the 300 seconds (see the attached file if

you want

to look at the data).

I have taken a stab at running a Zero-inflated Beta GLMM using

glmmTMB

in R like so:

betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
                              family = beta_family(),
                              ziformula=~1,
                              data = glmm_zone_data)

summary(beta_mod)

*Family: beta  ( logit )*






*Formula:          prop_time ~ group * zone + (1 | id)Zero inflation:
           ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
df.resid  -763.6   -736.3    388.8   -777.6      359Random
effects:Conditional model: Groups Name        Variance  Std.Dev. id
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
14Overdispersion parameter for beta family (): 13.1Conditional model:
             Estimate Std. Error z value Pr(>|z|)    (Intercept)
  -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524

-2.741

0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
***---Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model:            Estimate Std. Error z value

Pr(>|z|)

   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.

codes:  0

?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*

Does this look like the correct way of specifying the model? I am a
little confused about specifying and interpreting the zero-inflation
component - I have only just begun reading about this.

I noticed that the dispersion parameter is quite high at 13.1. I'm

not

sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant deviations

in the

dispersion (<2.2e-16) and KS tests. e.g.

DHARMa::testDispersion(beta_mod)

*DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated*

*data:  simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*



Many thanks,
Mike

On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:

Dear Michael,

Your data has bounds (lower bound at 0 and upper bound at 300) and

you

have a lot of data close to a boundary. In such a case, a continuous
distribution which ignores those bound is not a good idea. If the

time

spent outside of both zones is limited, then a long time in zone A

excludes

a long time in zone B by definition. Then I'd look towards a

multinomial

distribution. If the time spent outside both zones is dominant,

then you

can use a zero-inflated beta as you suggested. A zero-inflated

gamma might

be OK if the data is not too close to the upper boundary. If you are
considering zero-inflated beta vs zero-inflated gamma, then you

should

choose zero-inflated beta IMHO.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality

Assurance

thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may

be able

to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given

body of

data. ~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hello,

I am new to GLMMs and have a dataset where I have two distinct

groups

(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A or
zone_B
is recorded (out of a total time of 300s/observation period). For
most of
the time period the individuals are in neither zone.

I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).

Speaking to someone else, they said I should use a Binomial GLMM

using

cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).

However, the response variable is continuous (albeit with an upper
bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?

Should I convert the response into a proportion and use something
like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something

like:

prop_time ~ zone*group + (1|id)

The data is quite heavily right-skewed and contains a lot of 0's,

so

reading around it also looks like I may need to convert these into

zero-inflated/hurdle model?

Thank you for any suggestions,
Mike

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


------------------------------

End of R-sig-mixed-models Digest, Vol 173, Issue 19
***************************************************

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Michael Lawson

Thu, May 20, 2021 1:46 AM #

Hi again,

I wanted to look a little more at the rest of the simulated values
(non-zeros), so I took the mean proportion of simulated values in set bins
of 0.01 and compared this to the real values.


*bins <- seq(from = 0, to = 1, by = 0.01)*












*sim_seq <- c()for(i in bins){  a <- mean(unlist(lapply(sim_beta_glmm,
function(x){ length(which(x>i & x <= i+0.01))/length(x)}), use.names =
FALSE))  sim_seq <- c(sim_seq,a)}real_seq <- c()for(i in bins){  a <-
print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
real_seq <- c(real_seq,a)}*


*barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
legend.text = c("simulated","real"))*


*cbind(bins,sim_seq,real_seq)*











*bins         sim_seq    real_seq  [1,] 0.00 0.2199795081967 0.232240437
[2,] 0.01 0.1021612021858 0.166666667  [3,] 0.02 0.0750964480874
0.103825137  [4,] 0.03 0.0592128415301 0.073770492  [5,] 0.04
0.0478316939891 0.038251366  [6,] 0.05 0.0397267759563 0.030054645  [7,]
0.06 0.0331961748634 0.016393443  [8,] 0.07 0.0279322404372 0.013661202
[9,] 0.08 0.0238013661202 0.024590164 [10,] 0.09 0.0201101092896
0.010928962*

As you can see in the output above (and the plot attached), the real data
is quite a bit more right-skewed compared to the simulated values. Does
this look like a good enough fit or will I have to try a different model?

Many thanks,
Mike

On Wed, 19 May 2021 at 12:38, Michael Lawson <mrml500 at york.ac.uk> wrote:

Dear Alain,

Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number of
zeros in the middle of the distribution.

sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")

All the best,
Mike


On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:

Today's Topics:

    1. Re:  Model specification/family for a continuous/proportional
       response with many zeros (Michael Lawson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME]  Model specification/family for a
      continuous/proportional response with many zeros
Message-ID:
      <

CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Hi Thierry,

I understood after your previous message that a high dispersion

parameter

in beta models does not signify overdispersion. Both answers addressed

this

misconception in those two links I provided.

I suppose the real answer I am now looking for is -  how do I assess the
validity and fit of the model? Are there specifics or assumptions I must
take into account with zero-inflated beta glmms?

Just simulate 1000 data sets from your model, and see whether the
simulated data sets are comparable to your original data. One of the
things you can look at is whether the simulated data sets contain
similar number of zeros as in the observed data. I'm not sure whether
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):


https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

Thanks for your help,
Mike

On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

I think you misread my reply. I never stated that there's something

wrong

with the model. The only "problem" I highlighted was your misconception
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE

AND

FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 at york.ac.uk

Dear Thierry,

Thanks for the help. So if the dispersion parameter in this model

doesn't

fit with the beta distribution, are there any alternative approaches

I can

use?

I can't seem to find much information on this elsewhere other than

these

two threads:
https://stats.stackexchange.com/a/451453/233414
https://stats.stackexchange.com/a/466951/233414

All the best,
Mike

On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't

look at

the data).

You need to look at the definition for the "over"dispersion

parameter.

For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) /

(\phi + 1)

(see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
variance.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no

more

than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body

of data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <

mrml500 at york.ac.uk>:

Hi Thierry,

Thank you for your advice and speedy response.

Most of the data is closer to the lower bound (0). e.g. the mean

time

for group A in zone A = 15.1 seconds and group A in zone B = 3.8

seconds.

However there are a very small number of outliers near the upper

bound, the

largest being 294 out of the 300 seconds (see the attached file if

you want

to look at the data).

I have taken a stab at running a Zero-inflated Beta GLMM using

glmmTMB

in R like so:

betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
                              family = beta_family(),
                              ziformula=~1,
                              data = glmm_zone_data)

summary(beta_mod)

*Family: beta  ( logit )*






*Formula:          prop_time ~ group * zone + (1 | id)Zero

inflation:

           ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
df.resid  -763.6   -736.3    388.8   -777.6      359Random
effects:Conditional model: Groups Name        Variance  Std.Dev. id
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
14Overdispersion parameter for beta family (): 13.1Conditional

model:

             Estimate Std. Error z value Pr(>|z|)    (Intercept)
  -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524

-2.741

0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
***---Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model:            Estimate Std. Error z value

Pr(>|z|)

   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.

codes:  0

?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*

Does this look like the correct way of specifying the model? I am a
little confused about specifying and interpreting the zero-inflation
component - I have only just begun reading about this.

I noticed that the dispersion parameter is quite high at 13.1. I'm

not

sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant

deviations in the

dispersion (<2.2e-16) and KS tests. e.g.

DHARMa::testDispersion(beta_mod)

*DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated*

*data:  simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*



Many thanks,
Mike

On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:

Dear Michael,

Your data has bounds (lower bound at 0 and upper bound at 300) and

you

have a lot of data close to a boundary. In such a case, a

continuous

distribution which ignores those bound is not a good idea. If the

time

spent outside of both zones is limited, then a long time in zone A

excludes

a long time in zone B by definition. Then I'd look towards a

multinomial

distribution. If the time spent outside both zones is dominant,

then you

can use a zero-inflated beta as you suggested. A zero-inflated

gamma might

be OK if the data is not too close to the upper boundary. If you

are

considering zero-inflated beta vs zero-inflated gamma, then you

should

choose zero-inflated beta IMHO.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality

Assurance

thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may

be able

to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer

does

not ensure that a reasonable answer can be extracted from a given

body of

data. ~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hello,

I am new to GLMMs and have a dataset where I have two distinct

groups

(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A or
zone_B
is recorded (out of a total time of 300s/observation period). For
most of
the time period the individuals are in neither zone.

I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).

Speaking to someone else, they said I should use a Binomial GLMM

using

cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).

However, the response variable is continuous (albeit with an upper
bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?

Should I convert the response into a proportion and use something
like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something

like:

prop_time ~ zone*group + (1|id)

The data is quite heavily right-skewed and contains a lot of 0's,

so

reading around it also looks like I may need to convert these

into a

zero-inflated/hurdle model?

Thank you for any suggestions,
Mike

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


------------------------------

End of R-sig-mixed-models Digest, Vol 173, Issue 19
***************************************************

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sim_vs_real_zero_beta_glmm.png
Type: image/png
Size: 3157 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-sig-mixed-models/attachments/20210520/0645037f/attachment.png>

Thierry Onkelinx

Thu, May 20, 2021 2:08 AM #

Dear Mike,

Your code is completely unreadable. Did you send it in plain text? Sending
HTML email tends to mangle code.

I check the distribution by running a large number of simulations to get
the range of potential values. Then I divide the observed density (from the
data) by the expected density (from the simulations). See
https://inlatools.netlify.app/articles/distribution.html#distribution-checks-1
for an example. This is based on INLA models and doesn't handle beta
distributions (yet). But you can get an idea of how you can do this
yourself.

Best regards,

Thierry

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND
FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able to say
what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not
ensure that a reasonable answer can be extracted from a given body of data.
~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>

Op do 20 mei 2021 om 10:46 schreef Michael Lawson via R-sig-mixed-models <
r-sig-mixed-models at r-project.org>:

Hi again,

I wanted to look a little more at the rest of the simulated values
(non-zeros), so I took the mean proportion of simulated values in set bins
of 0.01 and compared this to the real values.


*bins <- seq(from = 0, to = 1, by = 0.01)*












*sim_seq <- c()for(i in bins){  a <- mean(unlist(lapply(sim_beta_glmm,
function(x){ length(which(x>i & x <= i+0.01))/length(x)}), use.names =
FALSE))  sim_seq <- c(sim_seq,a)}real_seq <- c()for(i in bins){  a <-
print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
real_seq <- c(real_seq,a)}*


*barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
legend.text = c("simulated","real"))*


*cbind(bins,sim_seq,real_seq)*











*bins         sim_seq    real_seq  [1,] 0.00 0.2199795081967 0.232240437
[2,] 0.01 0.1021612021858 0.166666667  [3,] 0.02 0.0750964480874
0.103825137  [4,] 0.03 0.0592128415301 0.073770492  [5,] 0.04
0.0478316939891 0.038251366  [6,] 0.05 0.0397267759563 0.030054645  [7,]
0.06 0.0331961748634 0.016393443  [8,] 0.07 0.0279322404372 0.013661202
[9,] 0.08 0.0238013661202 0.024590164 [10,] 0.09 0.0201101092896
0.010928962*

As you can see in the output above (and the plot attached), the real data
is quite a bit more right-skewed compared to the simulated values. Does
this look like a good enough fit or will I have to try a different model?

Many thanks,
Mike

On Wed, 19 May 2021 at 12:38, Michael Lawson <mrml500 at york.ac.uk> wrote:

Dear Alain,

Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number

of

zeros in the middle of the distribution.

sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")

All the best,
Mike


On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:

Today's Topics:

    1. Re:  Model specification/family for a continuous/proportional
       response with many zeros (Michael Lawson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME]  Model specification/family for a
      continuous/proportional response with many zeros
Message-ID:
      <

CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Hi Thierry,

I understood after your previous message that a high dispersion

parameter

in beta models does not signify overdispersion. Both answers addressed

this

misconception in those two links I provided.

I suppose the real answer I am now looking for is -  how do I assess

the

validity and fit of the model? Are there specifics or assumptions I

must

take into account with zero-inflated beta glmms?

Just simulate 1000 data sets from your model, and see whether the
simulated data sets are comparable to your original data. One of the
things you can look at is whether the simulated data sets contain
similar number of zeros as in the observed data. I'm not sure whether
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):

https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

Thanks for your help,
Mike

On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

I think you misread my reply. I never stated that there's something

wrong

with the model. The only "problem" I highlighted was your

misconception

about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND

FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no

more

than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 18 mei 2021 om 12:49 schreef Michael Lawson <

mrml500 at york.ac.uk

Dear Thierry,

Thanks for the help. So if the dispersion parameter in this model

doesn't

fit with the beta distribution, are there any alternative approaches

I can

use?

I can't seem to find much information on this elsewhere other than

these

two threads:
https://stats.stackexchange.com/a/451453/233414
https://stats.stackexchange.com/a/466951/233414

All the best,
Mike

On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't

look at

the data).

You need to look at the definition for the "over"dispersion

parameter.

For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) /

(\phi + 1)

(see ?glmmTMB::beta_family) Hence a large value of \phi implies a

low

variance.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality

Assurance

thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no

more

than asking him to perform a post-mortem examination: he may be

able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer

does

not

ensure that a reasonable answer can be extracted from a given body

of data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <

mrml500 at york.ac.uk>:

Hi Thierry,

Thank you for your advice and speedy response.

Most of the data is closer to the lower bound (0). e.g. the mean

time

for group A in zone A = 15.1 seconds and group A in zone B = 3.8

seconds.

However there are a very small number of outliers near the upper

bound, the

largest being 294 out of the 300 seconds (see the attached file if

you want

to look at the data).

I have taken a stab at running a Zero-inflated Beta GLMM using

glmmTMB

in R like so:

betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
                              family = beta_family(),
                              ziformula=~1,
                              data = glmm_zone_data)

summary(beta_mod)

*Family: beta  ( logit )*






*Formula:          prop_time ~ group * zone + (1 | id)Zero

inflation:

           ~1Data: glmm_zone_data     AIC      BIC   logLik

deviance

df.resid  -763.6   -736.3    388.8   -777.6      359Random
effects:Conditional model: Groups Name        Variance  Std.Dev.

id

(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
14Overdispersion parameter for beta family (): 13.1Conditional

model:

             Estimate Std. Error z value Pr(>|z|)    (Intercept)
  -2.7685     0.1031 -26.844  < 2e-16 ***groupB

 -0.4455

0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524

-2.741

0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
***---Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model:            Estimate Std. Error z value

Pr(>|z|)

   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.

codes:  0

?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*

Does this look like the correct way of specifying the model? I am

little confused about specifying and interpreting the

zero-inflation

component - I have only just begun reading about this.

I noticed that the dispersion parameter is quite high at 13.1. I'm

not

sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant

deviations in the

dispersion (<2.2e-16) and KS tests. e.g.

DHARMa::testDispersion(beta_mod)

*DHARMa nonparametric dispersion test via sd of residuals fitted

vs.

simulated*

*data:  simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*



Many thanks,
Mike

On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:

Dear Michael,

Your data has bounds (lower bound at 0 and upper bound at 300)

and

you

have a lot of data close to a boundary. In such a case, a

continuous

distribution which ignores those bound is not a good idea. If the

time

spent outside of both zones is limited, then a long time in zone

excludes

a long time in zone B by definition. Then I'd look towards a

multinomial

distribution. If the time spent outside both zones is dominant,

then you

can use a zero-inflated beta as you suggested. A zero-inflated

gamma might

be OK if the data is not too close to the upper boundary. If you

are

considering zero-inflated beta vs zero-inflated gamma, then you

should

choose zero-inflated beta IMHO.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality

Assurance

thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be

no

more than asking him to perform a post-mortem examination: he may

be able

to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer

does

not ensure that a reasonable answer can be extracted from a given

body of

data. ~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hello,

I am new to GLMMs and have a dataset where I have two distinct

groups

(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A

or

zone_B
is recorded (out of a total time of 300s/observation period).

For

most of
the time period the individuals are in neither zone.

I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).

Speaking to someone else, they said I should use a Binomial GLMM

using

cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).

However, the response variable is continuous (albeit with an

upper

bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?

Should I convert the response into a proportion and use

something

like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something

like:

prop_time ~ zone*group + (1|id)

The data is quite heavily right-skewed and contains a lot of

0's,

so

reading around it also looks like I may need to convert these

into a

zero-inflated/hurdle model?

Thank you for any suggestions,
Mike

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


------------------------------

End of R-sig-mixed-models Digest, Vol 173, Issue 19
***************************************************

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Michael Lawson

Thu, May 20, 2021 2:34 AM #

Dear Thierry,

Sorry, I didn't realise I was sending everything in HTML mode. I have
enabled plain text mode so hopefully it has worked this time (code
below). Thanks for the suggestion and link - comparing the
observed-expected density looks a lot better than just comparing the
means like I did. I'll have a go!

Thanks,
Mike

bins <- seq(from = 0, to = 1, by = 0.01)

sim_seq <- c()
for(i in bins){
  a <- mean(unlist(lapply(sim_beta_glmm, function(x){ length(which(x>i
& x <= i+0.01))/length(x)}), use.names = FALSE))
  sim_seq <- c(sim_seq,a)
}

real_seq <- c()
for(i in bins){
  a <- print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
  real_seq <- c(real_seq,a)
}

barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
legend.text = c("simulated","real"))

cbind(bins,sim_seq,real_seq)

bins         sim_seq    real_seq
  [1,] 0.00 0.2199795081967 0.232240437
  [2,] 0.01 0.1021612021858 0.166666667
  [3,] 0.02 0.0750964480874 0.103825137
  [4,] 0.03 0.0592128415301 0.073770492
  [5,] 0.04 0.0478316939891 0.038251366
  [6,] 0.05 0.0397267759563 0.030054645
  [7,] 0.06 0.0331961748634 0.016393443
  [8,] 0.07 0.0279322404372 0.013661202
  [9,] 0.08 0.0238013661202 0.024590164
 [10,] 0.09 0.0201101092896 0.010928962

On Thu, 20 May 2021 at 10:08, Thierry Onkelinx <thierry.onkelinx at inbo.be> wrote:

Dear Mike,

Your code is completely unreadable. Did you send it in plain text? Sending HTML email tends to mangle code.

I check the distribution by running a large number of simulations to get the range of potential values. Then I divide the observed density (from the data) by the expected density (from the simulations). See https://inlatools.netlify.app/articles/distribution.html#distribution-checks-1 for an example. This is based on INLA models and doesn't handle beta distributions (yet). But you can get an idea of how you can do this yourself.

Best regards,

Thierry

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

Op do 20 mei 2021 om 10:46 schreef Michael Lawson via R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hi again,

I wanted to look a little more at the rest of the simulated values
(non-zeros), so I took the mean proportion of simulated values in set bins
of 0.01 and compared this to the real values.


*bins <- seq(from = 0, to = 1, by = 0.01)*












*sim_seq <- c()for(i in bins){  a <- mean(unlist(lapply(sim_beta_glmm,
function(x){ length(which(x>i & x <= i+0.01))/length(x)}), use.names =
FALSE))  sim_seq <- c(sim_seq,a)}real_seq <- c()for(i in bins){  a <-
print(mean(unlist(lapply(glmm_zone_data$prop_time, function(x){
length(which(x>i & x <= i+0.01))/length(x)}), use.names = FALSE)))
real_seq <- c(real_seq,a)}*


*barplot(rbind(sim_seq,real_seq),col=c("green","red"),beside = TRUE,
legend.text = c("simulated","real"))*


*cbind(bins,sim_seq,real_seq)*











*bins         sim_seq    real_seq  [1,] 0.00 0.2199795081967 0.232240437
[2,] 0.01 0.1021612021858 0.166666667  [3,] 0.02 0.0750964480874
0.103825137  [4,] 0.03 0.0592128415301 0.073770492  [5,] 0.04
0.0478316939891 0.038251366  [6,] 0.05 0.0397267759563 0.030054645  [7,]
0.06 0.0331961748634 0.016393443  [8,] 0.07 0.0279322404372 0.013661202
[9,] 0.08 0.0238013661202 0.024590164 [10,] 0.09 0.0201101092896
0.010928962*

As you can see in the output above (and the plot attached), the real data
is quite a bit more right-skewed compared to the simulated values. Does
this look like a good enough fit or will I have to try a different model?

Many thanks,
Mike

On Wed, 19 May 2021 at 12:38, Michael Lawson <mrml500 at york.ac.uk> wrote:

Dear Alain,

Thank you for the suggestion. I tried simulating the 0's as you suggested
in the following way. The output is reassuring - with the actual number of
zeros in the middle of the distribution.

sim_beta_glmm <- simulate(beta_mod, nsim = 10000)
sim_zeros <- unlist(lapply(sim_beta_glmm, function(x){
length(which(x==0))/length(x)}), use.names = FALSE)
hist(sim_zeros, breaks = c(100))
abline(v = plogis(-1.1804), col = "red")

All the best,
Mike


On Wed, 19 May 2021 at 11:19, Highland Statistics Ltd via
R-sig-mixed-models <r-sig-mixed-models at r-project.org> wrote:

Today's Topics:

    1. Re:  Model specification/family for a continuous/proportional
       response with many zeros (Michael Lawson)

----------------------------------------------------------------------

Message: 1
Date: Wed, 19 May 2021 09:31:46 +0100
From: Michael Lawson <mrml500 at york.ac.uk>
To: Thierry Onkelinx <thierry.onkelinx at inbo.be>
Cc: r-sig-mixed-models <r-sig-mixed-models at r-project.org>
Subject: Re: [R-sig-ME]  Model specification/family for a
      continuous/proportional response with many zeros
Message-ID:
      <

CACtWw1GAnDJ1d8CWh_DODP6jPqfsyptc-RV0gw0_5z9bWEwubw at mail.gmail.com>

Content-Type: text/plain; charset="utf-8"

Hi Thierry,

I understood after your previous message that a high dispersion

parameter

in beta models does not signify overdispersion. Both answers addressed

this

misconception in those two links I provided.

I suppose the real answer I am now looking for is -  how do I assess the
validity and fit of the model? Are there specifics or assumptions I must
take into account with zero-inflated beta glmms?

Just simulate 1000 data sets from your model, and see whether the
simulated data sets are comparable to your original data. One of the
things you can look at is whether the simulated data sets contain
similar number of zeros as in the observed data. I'm not sure whether
the DHARMA package can do this for you. If not..it is easy to program.

See also Figure 8 in (sorry for self-citing):


https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12577

Alain

Thanks for your help,
Mike

On Tue, 18 May 2021 at 13:57, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

I think you misread my reply. I never stated that there's something

wrong

with the model. The only "problem" I highlighted was your misconception
about the "high overdispersion". In this case, a high parameter value
indicates a low variance, which is what we mostly want to see.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE

AND

FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no more
than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body of

data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op di 18 mei 2021 om 12:49 schreef Michael Lawson <mrml500 at york.ac.uk

Dear Thierry,

Thanks for the help. So if the dispersion parameter in this model

doesn't

fit with the beta distribution, are there any alternative approaches

I can

use?

I can't seem to find much information on this elsewhere other than

these

two threads:
https://stats.stackexchange.com/a/451453/233414
https://stats.stackexchange.com/a/466951/233414

All the best,
Mike

On Tue, 18 May 2021 at 08:12, Thierry Onkelinx <

thierry.onkelinx at inbo.be>

wrote:

Dear Mike,

The zero-inflation is specified on the logit scale. plogis(-1.18) =
0.235 23.5% zero seems reasonable when reading your story. (Didn't

look at

the data).

You need to look at the definition for the "over"dispersion

parameter.

For a beta distribution is \phi with Var(y) = \mu * (1 - \mu) /

(\phi + 1)

(see ?glmmTMB::beta_family) Hence a large value of \phi implies a low
variance.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no

more

than asking him to perform a post-mortem examination: he may be able

to say

what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does

not

ensure that a reasonable answer can be extracted from a given body

of data.

~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 15:45 schreef Michael Lawson <

mrml500 at york.ac.uk>:

Hi Thierry,

Thank you for your advice and speedy response.

Most of the data is closer to the lower bound (0). e.g. the mean

time

for group A in zone A = 15.1 seconds and group A in zone B = 3.8

seconds.

However there are a very small number of outliers near the upper

bound, the

largest being 294 out of the 300 seconds (see the attached file if

you want

to look at the data).

I have taken a stab at running a Zero-inflated Beta GLMM using

glmmTMB

in R like so:

betta_mod <- glmmTMB(prop_time ~ group*zone + (1|id),
                              family = beta_family(),
                              ziformula=~1,
                              data = glmm_zone_data)

summary(beta_mod)

*Family: beta  ( logit )*






*Formula:          prop_time ~ group * zone + (1 | id)Zero

inflation:

           ~1Data: glmm_zone_data     AIC      BIC   logLik deviance
df.resid  -763.6   -736.3    388.8   -777.6      359Random
effects:Conditional model: Groups Name        Variance  Std.Dev. id
(Intercept) 2.386e-09 4.885e-05Number of obs: 366, groups:  id,
14Overdispersion parameter for beta family (): 13.1Conditional

model:

             Estimate Std. Error z value Pr(>|z|)    (Intercept)
  -2.7685     0.1031 -26.844  < 2e-16 ***groupB             -0.4455
0.1498  -2.975 0.002932 **zonezone_B         -0.4179     0.1524

-2.741

0.006124 **groupB:zonezone_B   0.8443     0.2190   3.855 0.000116
***---Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ?
1Zero-inflation model:            Estimate Std. Error z value

Pr(>|z|)

   (Intercept)  -1.1804     0.1233  -9.575   <2e-16 ***---Signif.

codes:  0

?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1*

Does this look like the correct way of specifying the model? I am a
little confused about specifying and interpreting the zero-inflation
component - I have only just begun reading about this.

I noticed that the dispersion parameter is quite high at 13.1. I'm

not

sure if this matters for beta models?. I tried running DHARMa
simulateResiduals on the model output and got significant

deviations in the

dispersion (<2.2e-16) and KS tests. e.g.

DHARMa::testDispersion(beta_mod)

*DHARMa nonparametric dispersion test via sd of residuals fitted vs.
simulated*

*data:  simulationOutput*
*ratioObsSim = 1.3612, p-value < 2.2e-16*
*alternative hypothesis: two.sided*



Many thanks,
Mike

On Mon, 17 May 2021 at 13:22, Thierry Onkelinx <
thierry.onkelinx at inbo.be> wrote:

Dear Michael,

Your data has bounds (lower bound at 0 and upper bound at 300) and

you

have a lot of data close to a boundary. In such a case, a

continuous

distribution which ignores those bound is not a good idea. If the

time

spent outside of both zones is limited, then a long time in zone A

excludes

a long time in zone B by definition. Then I'd look towards a

multinomial

distribution. If the time spent outside both zones is dominant,

then you

can use a zero-inflated beta as you suggested. A zero-inflated

gamma might

be OK if the data is not too close to the upper boundary. If you

are

considering zero-inflated beta vs zero-inflated gamma, then you

should

choose zero-inflated beta IMHO.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR

NATURE

AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality

Assurance

thierry.onkelinx at inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////

To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may

be able

to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer

does

not ensure that a reasonable answer can be extracted from a given

body of

data. ~ John Tukey

///////////////////////////////////////////////////////////////////////////////////////////

<https://www.inbo.be>


Op ma 17 mei 2021 om 13:52 schreef Michael Lawson via
R-sig-mixed-models <r-sig-mixed-models at r-project.org>:

Hello,

I am new to GLMMs and have a dataset where I have two distinct

groups

(A
and B) of 7 individuals each. The data consists of repeated
measurements of
each individual where the amount of time spent at either zone_A or
zone_B
is recorded (out of a total time of 300s/observation period). For
most of
the time period the individuals are in neither zone.

I want to test if group A and group B spend more time in zone A
compared to
zone B (and vice versa).

Speaking to someone else, they said I should use a Binomial GLMM

using

cbind. i.e.
cbind(time_at_zone_A, time_at_zone_B) ~ group + (1| id).

However, the response variable is continuous (albeit with an upper
bound of
300 seconds per observation period), so I'm not sure if this is
appropriate?

Should I convert the response into a proportion and use something
like a
Beta GLMM or else use a continuous (Gamma) GLMM? e.g. something

like:

prop_time ~ zone*group + (1|id)

The data is quite heavily right-skewed and contains a lot of 0's,

so

reading around it also looks like I may need to convert these

into a

zero-inflated/hurdle model?

Thank you for any suggestions,
Mike

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models mailing list
R-sig-mixed-models at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


------------------------------

End of R-sig-mixed-models Digest, Vol 173, Issue 19
***************************************************

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

John Maindonald

Fri, May 21, 2021 3:30 AM #

The code that follows below demonstrates what I find a very odd
issue with r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>, and with the comparison between
glmmTMB::glmmTMB and lme4::glmer, for models that have the
same model formulae.

1) In the glmmTMB model as fitted to `ffly`, one predicted value is
infinite.  This, even though all coefficients and standard errors
have finite values.  This surely indicates that there is something
odd in the way that predictions on the scale of the linear
predictor are calculated.

2) Initially, I'd thought that the `Inf` value explained the
difference between the two AIC values.  But the difference
pretty much stays the same when the "offending" point is
removed.

I've not checked what happens when there are no observation level
random effects.  Is there some difference in the way that the
model formulae are interpreted in the two cases?


ffly <- read.csv('https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv')
ffly$obs <- factor(ffly$obs)

form1 <- cbind(Dead,Live)~0+trtGp/TrtTime+(1|obs)+(1|trtGpRep)

library(lme4); library(glmmTMB)
ObsTMB.cll <- glmmTMB(form1,
                      family=binomial(link="cloglog"), data=ffly)
Obsglmer.cll <- glmer(form1, nAGQ=0,
                      family=binomial(link="cloglog"), data=ffly)

round(AIC(Obsglmer.cll, ObsTMB.cll), 2)
##                      df    AIC
## Obsglmer.cll 14 639.44
## ObsTMB.cll  14 636.02

round(c(max(predict(ObsTMB.cll)),max(predict(Obsglmer.cll))),3)
## [1]   Inf 3.833

range(fixef(ObsTMB.cll))
## [1] -5.28780064553  1.15952249272
range(vcov(ObsTMB.cll))
## [1] -0.0546986905338  0.2887049942215

## Try also; there are small deviations from the line
plot(predict(Obsglmer.cll)~predict(ObsTMB.cll))
abline(0,1)

## Remove offending point
ffly1 <- ffly[-which.max(predict(ObsTMB.cll)),]
ObsTMB1.cll <- glmmTMB(form1,
                      family=binomial(link="cloglog"), data=ffly1)
Obsglmer1.cll <- glmer(form1, nAGQ=0,
                      family=binomial(link="cloglog"), data=ffly1)

cbind(AIC(Obsglmer.cll, ObsTMB.cll), AIC1=AIC(Obsglmer1.cll, ObsTMB1.cll)[,2])
##                       df                    AIC                  AIC1
## Obsglmer.cll  14 639.441888969 639.441889196
## ObsTMB.cll   14 636.016597730 636.016597723
  ## Observe that changes are in the final four decimal places.


round(rbind(glmer=fixef(Obsglmer1.cll),
TMB=unclass(fixef(ObsTMB1.cll)$cond)),3)

      trtGpspAEgg trtGpspAL2 trtGpspAL3 trtGpspBEgg trtGpspBL2 trtGpspBL3
glmer       0.772     -2.255     -3.299      -1.803     -2.632     -5.114
TMB         0.790     -2.367     -3.441      -1.889     -2.757     -5.288
      trtGpspAEgg:TrtTime trtGpspAL2:TrtTime trtGpspAL3:TrtTime
glmer               0.231              0.372              0.563
TMB                 0.288              0.398              0.602
      trtGpspBEgg:TrtTime trtGpspBL2:TrtTime trtGpspBL3:TrtTime
glmer               0.278              0.517              1.112
TMB                 0.299              0.554              1.160


John Maindonald             email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au>

Ben Bolker

Fri, May 21, 2021 6:27 AM #

I don't know yet, I will dig in and try to see what's going on.  An 
infinite predicted value certainly seems like an issue.

   The most obvious difference is that nAGQ=0 is actually doing 
something slightly different (it's fitting the fixed-effect parameters 
as part of the PIRLS loop rather than maximizing over them explicitly); 
I would rather compare the nAGQ=1 case, just to minimize the number of 
differences.

On 5/21/21 6:30 AM, John Maindonald wrote:

The code that follows below demonstrates what I find a very odd
issue with r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org>, and with the comparison between
glmmTMB::glmmTMB and lme4::glmer, for models that have the
same model formulae.

1) In the glmmTMB model as fitted to `ffly`, one predicted value is
infinite.  This, even though all coefficients and standard errors
have finite values.  This surely indicates that there is something
odd in the way that predictions on the scale of the linear
predictor are calculated.

2) Initially, I'd thought that the `Inf` value explained the
difference between the two AIC values.  But the difference
pretty much stays the same when the "offending" point is
removed.

I've not checked what happens when there are no observation level
random effects.  Is there some difference in the way that the
model formulae are interpreted in the two cases?


ffly <- read.csv('https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv')
ffly$obs <- factor(ffly$obs)

form1 <- cbind(Dead,Live)~0+trtGp/TrtTime+(1|obs)+(1|trtGpRep)

library(lme4); library(glmmTMB)
ObsTMB.cll <- glmmTMB(form1,
                       family=binomial(link="cloglog"), data=ffly)
Obsglmer.cll <- glmer(form1, nAGQ=0,
                       family=binomial(link="cloglog"), data=ffly)

round(AIC(Obsglmer.cll, ObsTMB.cll), 2)
##                      df    AIC
## Obsglmer.cll 14 639.44
## ObsTMB.cll  14 636.02

round(c(max(predict(ObsTMB.cll)),max(predict(Obsglmer.cll))),3)
## [1]   Inf 3.833

range(fixef(ObsTMB.cll))
## [1] -5.28780064553  1.15952249272
range(vcov(ObsTMB.cll))
## [1] -0.0546986905338  0.2887049942215

## Try also; there are small deviations from the line
plot(predict(Obsglmer.cll)~predict(ObsTMB.cll))
abline(0,1)

## Remove offending point
ffly1 <- ffly[-which.max(predict(ObsTMB.cll)),]
ObsTMB1.cll <- glmmTMB(form1,
                       family=binomial(link="cloglog"), data=ffly1)
Obsglmer1.cll <- glmer(form1, nAGQ=0,
                       family=binomial(link="cloglog"), data=ffly1)

cbind(AIC(Obsglmer.cll, ObsTMB.cll), AIC1=AIC(Obsglmer1.cll, ObsTMB1.cll)[,2])
##                       df                    AIC                  AIC1
## Obsglmer.cll  14 639.441888969 639.441889196
## ObsTMB.cll   14 636.016597730 636.016597723
   ## Observe that changes are in the final four decimal places.


round(rbind(glmer=fixef(Obsglmer1.cll),
TMB=unclass(fixef(ObsTMB1.cll)$cond)),3)

       trtGpspAEgg trtGpspAL2 trtGpspAL3 trtGpspBEgg trtGpspBL2 trtGpspBL3
glmer       0.772     -2.255     -3.299      -1.803     -2.632     -5.114
TMB         0.790     -2.367     -3.441      -1.889     -2.757     -5.288
       trtGpspAEgg:TrtTime trtGpspAL2:TrtTime trtGpspAL3:TrtTime
glmer               0.231              0.372              0.563
TMB                 0.288              0.398              0.602
       trtGpspBEgg:TrtTime trtGpspBL2:TrtTime trtGpspBL3:TrtTime
glmer               0.278              0.517              1.112
TMB                 0.299              0.554              1.160


John Maindonald             email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au>


	[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

John Maindonald

Fri, May 21, 2021 3:24 PM #

Setting nAGQ=1 does bring the AIC values closer together, albeit with
convergence failure when other parameters are left at their defaults.
Running allFit() does not provide (to me) any obvious clues on what
might work better.  Or, should one just accept the less strict tolerance?
Nor does centering and scale scTime help in this case.

llik <- summary(allFit(Obsglmer1.cll))$llik
max(lik)-lik
##                       bobyqa                   Nelder_Mead
##           0.00000000000e+00             1.31201379190e-04
##                   nlminbwrap                         nmkbw
##            1.76015646502e-09             3.48215053236e-06
##              optimx.L-BFGS-B nloptwrap.NLOPT_LN_NELDERMEAD
##            1.09246872171e-06             1.13990722639e-07
##    nloptwrap.NLOPT_LN_BOBYQA
##            5.81495783081e-07


John Maindonald             email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au>

On 22/05/2021, at 01:27, Ben Bolker <bbolker at gmail.com<mailto:bbolker at gmail.com>> wrote:

I don't know yet, I will dig in and try to see what's going on.  An infinite predicted value certainly seems like an issue.

 The most obvious difference is that nAGQ=0 is actually doing something slightly different (it's fitting the fixed-effect parameters as part of the PIRLS loop rather than maximizing over them explicitly); I would rather compare the nAGQ=1 case, just to minimize the number of differences.

On 5/21/21 6:30 AM, John Maindonald wrote:

The code that follows below demonstrates what I find a very odd
issue with r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org><mailto:r-sig-mixed-models at r-project.org>, and with the comparison between
glmmTMB::glmmTMB and lme4::glmer, for models that have the
same model formulae.
1) In the glmmTMB model as fitted to `ffly`, one predicted value is
infinite.  This, even though all coefficients and standard errors
have finite values.  This surely indicates that there is something
odd in the way that predictions on the scale of the linear
predictor are calculated.
2) Initially, I'd thought that the `Inf` value explained the
difference between the two AIC values.  But the difference
pretty much stays the same when the "offending" point is
removed.
I've not checked what happens when there are no observation level
random effects.  Is there some difference in the way that the
model formulae are interpreted in the two cases?
ffly <- read.csv('https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv')
ffly$obs <- factor(ffly$obs)
form1 <- cbind(Dead,Live)~0+trtGp/TrtTime+(1|obs)+(1|trtGpRep)
library(lme4); library(glmmTMB)
ObsTMB.cll <- glmmTMB(form1,
                      family=binomial(link="cloglog"), data=ffly)
Obsglmer.cll <- glmer(form1, nAGQ=0,
                      family=binomial(link="cloglog"), data=ffly)
round(AIC(Obsglmer.cll, ObsTMB.cll), 2)
##                      df    AIC
## Obsglmer.cll 14 639.44
## ObsTMB.cll  14 636.02
round(c(max(predict(ObsTMB.cll)),max(predict(Obsglmer.cll))),3)
## [1]   Inf 3.833
range(fixef(ObsTMB.cll))
## [1] -5.28780064553  1.15952249272
range(vcov(ObsTMB.cll))
## [1] -0.0546986905338  0.2887049942215
## Try also; there are small deviations from the line
plot(predict(Obsglmer.cll)~predict(ObsTMB.cll))
abline(0,1)
## Remove offending point
ffly1 <- ffly[-which.max(predict(ObsTMB.cll)),]
ObsTMB1.cll <- glmmTMB(form1,
                      family=binomial(link="cloglog"), data=ffly1)
Obsglmer1.cll <- glmer(form1, nAGQ=0,
                      family=binomial(link="cloglog"), data=ffly1)
cbind(AIC(Obsglmer.cll, ObsTMB.cll), AIC1=AIC(Obsglmer1.cll, ObsTMB1.cll)[,2])
##                       df                    AIC                  AIC1
## Obsglmer.cll  14 639.441888969 639.441889196
## ObsTMB.cll   14 636.016597730 636.016597723
  ## Observe that changes are in the final four decimal places.
round(rbind(glmer=fixef(Obsglmer1.cll),
TMB=unclass(fixef(ObsTMB1.cll)$cond)),3)
      trtGpspAEgg trtGpspAL2 trtGpspAL3 trtGpspBEgg trtGpspBL2 trtGpspBL3
glmer       0.772     -2.255     -3.299      -1.803     -2.632     -5.114
TMB         0.790     -2.367     -3.441      -1.889     -2.757     -5.288
      trtGpspAEgg:TrtTime trtGpspAL2:TrtTime trtGpspAL3:TrtTime
glmer               0.231              0.372              0.563
TMB                 0.288              0.398              0.602
      trtGpspBEgg:TrtTime trtGpspBL2:TrtTime trtGpspBL3:TrtTime
glmer               0.278              0.517              1.112
TMB                 0.299              0.554              1.160
John Maindonald             email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au><mailto:john.maindonald at anu.edu.au>
_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models


_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Ben Bolker

Fri, May 21, 2021 5:16 PM #

What allFit indicates to me is that all of the different optimizers 
reach very similar log-likelihoods (max difference of 1-4), and if I 
recall correctly all of the fixed effects are also very similar, so I 
would feel free to disregard the convergence warning as a false positive.

   The AIC values are still a lot more different than I would like, and 
digging into the gory details (available on request) it seems that they 
are actually giving different likelihood calculations -- e.g. for the 
parameters reached by glmer, glmmTMB returns a deviance of 607.037 while 
glmer returns 609.3362.  I suspect something deep in the guts of e.g. 
the way the Laplace approximation is computed (not technically a bug in 
either code, but some kind of numerical instability).

   This is good news if what you're interested in is making sure you get 
the correct parameter values.  It's bad news if you're interested in 
mixing and matching glmmTMB and lme4 fits in a model comparison.

   I'm not horribly surprised that things are a bit numerically unstable 
- 12 parameters for 95 observations, and the extremely rapid tail 
behaviour of cloglog often makes things worse.

   I do intend to track down where the infinite prediction is coming 
from too, but haven't gotten there yet.

I'm going to include some of my code in case it's useful, but it's 
poorly commented, messy *and may be in the wrong order* - I was bouncing 
around developing it interactively.

====
Obsglmer1.cll <- glmer(form1, nAGQ=1,
                       family=binomial(link="cloglog"), data=ffly)
aa <- allFit(Obsglmer1.cll)
summary(aa)$fixef
ll <- -summary(aa)$llik
ll-min(ll)
dotwhisker::dwplot(modList[-2])

library(dotwhisker)
modList <- list(glmmTMB=ObsTMB.cll, nAGQ0=Obsglmer.cll,
                 nAGQ1=Obsglmer1.cll)
f1 <- ObsTMB.cll$obj$fn
## TMB params, TMB format
p1 <- with(environment(f1),last.par.best[-random])
## TMB params, glmer format
p1B <- c(exp(p1[names(p1)=="theta"]), p1[names(p1)=="beta"])

f3 <- getME(Obsglmer1.cll,"devfun")
## glmer params, glmer format
p3 <- unlist(getME(Obsglmer1.cll,c("theta","beta")))
## glmer params, TMB format
p3B <- c(p3[-(1:2)],log(p3[1:2]))
f3(p1B) ## 609.3559

f3(p3)  ## 609.3362
2*f1(p3B) ## 608.037

2*f1(p1) ## 608.0166

Obsglmer2.cll <- update(Obsglmer1.cll,
                         start=list(theta=p1B[1:2],
                                    fixef=p1B[-(1:2)]))

dwplot(modList)
library(bbmle)
AICtab(modList, logLik=TRUE)
modList2 <- c(modList, list(nAGQ1_newstart=Obsglmer2.cll))
AICtab(modList2, logLik=TRUE)

On 5/21/21 6:24 PM, John Maindonald wrote:

Setting nAGQ=1 does bring the AIC values closer together, albeit with
convergence failure when other parameters are left at their defaults.
Running allFit() does not provide (to me) any obvious clues on what
might work better. ?Or, should one just accept the less strict tolerance?
Nor does centering and scale scTime help in this case.

llik <- summary(allFit(Obsglmer1.cll))$llik
max(lik)-lik
## ? ? ? ? ? ? ? ? ? ? ? bobyqa ? ? ? ? ? ? ? ? ? Nelder_Mead
## ? ? ? ? ? 0.00000000000e+00 ? ? ? ? ? ? 1.31201379190e-04
## ? ? ? ? ? ? ? ? ? nlminbwrap ? ? ? ? ? ? ? ? ? ? ? ? nmkbw
## ? ? ? ? ? ?1.76015646502e-09 ? ? ? ? ? ? 3.48215053236e-06
## ? ? ? ? ? ? ?optimx.L-BFGS-B nloptwrap.NLOPT_LN_NELDERMEAD
## ? ? ? ? ? ?1.09246872171e-06 ? ? ? ? ? ? 1.13990722639e-07
## ? ?nloptwrap.NLOPT_LN_BOBYQA
## ? ? ? ? ? ?5.81495783081e-07

John Maindonaldemail: john.maindonald at anu.edu.au 
<mailto:john.maindonald at anu.edu.au>

On 22/05/2021, at 01:27, Ben Bolker <bbolker at gmail.com 
<mailto:bbolker at gmail.com>> wrote:

?I don't know yet, I will dig in and try to see what's going on. ?An 
infinite predicted value certainly seems like an issue.

?The most obvious difference is that nAGQ=0 is actually doing 
something slightly different (it's fitting the fixed-effect parameters 
as part of the PIRLS loop rather than maximizing over them 
explicitly); I would rather compare the nAGQ=1 case, just to minimize 
the number of differences.

On 5/21/21 6:30 AM, John Maindonald wrote:

The code that follows below demonstrates what I find a very odd
issue with r-sig-mixed-models at r-project.org 
<mailto:r-sig-mixed-models at r-project.org><mailto:r-sig-mixed-models at r-project.org 
<mailto:r-sig-mixed-models at r-project.org>>, and with the comparison 
between
glmmTMB::glmmTMB and lme4::glmer, for models that have the
same model formulae.
1) In the glmmTMB model as fitted to `ffly`, one predicted value is
infinite. ?This, even though all coefficients and standard errors
have finite values. ?This surely indicates that there is something
odd in the way that predictions on the scale of the linear
predictor are calculated.
2) Initially, I'd thought that the `Inf` value explained the
difference between the two AIC values. ?But the difference
pretty much stays the same when the "offending" point is
removed.
I've not checked what happens when there are no observation level
random effects. ?Is there some difference in the way that the
model formulae are interpreted in the two cases?
ffly <- 
read.csv('https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv <https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv>')
ffly$obs <- factor(ffly$obs)
form1 <- cbind(Dead,Live)~0+trtGp/TrtTime+(1|obs)+(1|trtGpRep)
library(lme4); library(glmmTMB)
ObsTMB.cll <- glmmTMB(form1,
??????????????????????family=binomial(link="cloglog"), data=ffly)
Obsglmer.cll <- glmer(form1, nAGQ=0,
??????????????????????family=binomial(link="cloglog"), data=ffly)
round(AIC(Obsglmer.cll, ObsTMB.cll), 2)
## ?????????????????????df ???AIC
## Obsglmer.cll 14 639.44
## ObsTMB.cll ?14 636.02
round(c(max(predict(ObsTMB.cll)),max(predict(Obsglmer.cll))),3)
## [1] ??Inf 3.833
range(fixef(ObsTMB.cll))
## [1] -5.28780064553 ?1.15952249272
range(vcov(ObsTMB.cll))
## [1] -0.0546986905338 ?0.2887049942215
## Try also; there are small deviations from the line
plot(predict(Obsglmer.cll)~predict(ObsTMB.cll))
abline(0,1)
## Remove offending point
ffly1 <- ffly[-which.max(predict(ObsTMB.cll)),]
ObsTMB1.cll <- glmmTMB(form1,
??????????????????????family=binomial(link="cloglog"), data=ffly1)
Obsglmer1.cll <- glmer(form1, nAGQ=0,
??????????????????????family=binomial(link="cloglog"), data=ffly1)
cbind(AIC(Obsglmer.cll, ObsTMB.cll), AIC1=AIC(Obsglmer1.cll, 
ObsTMB1.cll)[,2])
## ??????????????????????df ???????????????????AIC ?????????????????AIC1
## Obsglmer.cll ?14 639.441888969 639.441889196
## ObsTMB.cll ??14 636.016597730 636.016597723
??## Observe that changes are in the final four decimal places.
round(rbind(glmer=fixef(Obsglmer1.cll),
TMB=unclass(fixef(ObsTMB1.cll)$cond)),3)
??????trtGpspAEgg trtGpspAL2 trtGpspAL3 trtGpspBEgg trtGpspBL2 trtGpspBL3
glmer ??????0.772 ????-2.255 ????-3.299 ?????-1.803 ????-2.632 ????-5.114
TMB ????????0.790 ????-2.367 ????-3.441 ?????-1.889 ????-2.757 ????-5.288
??????trtGpspAEgg:TrtTime trtGpspAL2:TrtTime trtGpspAL3:TrtTime
glmer ??????????????0.231 ?????????????0.372 ?????????????0.563
TMB ????????????????0.288 ?????????????0.398 ?????????????0.602
??????trtGpspBEgg:TrtTime trtGpspBL2:TrtTime trtGpspBL3:TrtTime
glmer ??????????????0.278 ?????????????0.517 ?????????????1.112
TMB ????????????????0.299 ?????????????0.554 ?????????????1.160
John Maindonald email: john.maindonald at anu.edu.au 
<mailto:john.maindonald at anu.edu.au><mailto:john.maindonald at anu.edu.au 
<mailto:john.maindonald at anu.edu.au>>
[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org 
<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models 
<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>

_______________________________________________
R-sig-mixed-models at r-project.org 
<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models 
<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>

Ben Bolker

Fri, May 21, 2021 6:29 PM #

I figured out the infinite-prediction thing (didn't solve it, but 
figured it out):

https://github.com/glmmTMB/glmmTMB/issues/696

On 5/21/21 6:24 PM, John Maindonald wrote:

Setting nAGQ=1 does bring the AIC values closer together, albeit with
convergence failure when other parameters are left at their defaults.
Running allFit() does not provide (to me) any obvious clues on what
might work better. ?Or, should one just accept the less strict tolerance?
Nor does centering and scale scTime help in this case.

llik <- summary(allFit(Obsglmer1.cll))$llik
max(lik)-lik
## ? ? ? ? ? ? ? ? ? ? ? bobyqa ? ? ? ? ? ? ? ? ? Nelder_Mead
## ? ? ? ? ? 0.00000000000e+00 ? ? ? ? ? ? 1.31201379190e-04
## ? ? ? ? ? ? ? ? ? nlminbwrap ? ? ? ? ? ? ? ? ? ? ? ? nmkbw
## ? ? ? ? ? ?1.76015646502e-09 ? ? ? ? ? ? 3.48215053236e-06
## ? ? ? ? ? ? ?optimx.L-BFGS-B nloptwrap.NLOPT_LN_NELDERMEAD
## ? ? ? ? ? ?1.09246872171e-06 ? ? ? ? ? ? 1.13990722639e-07
## ? ?nloptwrap.NLOPT_LN_BOBYQA
## ? ? ? ? ? ?5.81495783081e-07

John Maindonaldemail: john.maindonald at anu.edu.au 
<mailto:john.maindonald at anu.edu.au>

On 22/05/2021, at 01:27, Ben Bolker <bbolker at gmail.com 
<mailto:bbolker at gmail.com>> wrote:

?I don't know yet, I will dig in and try to see what's going on. ?An 
infinite predicted value certainly seems like an issue.

?The most obvious difference is that nAGQ=0 is actually doing 
something slightly different (it's fitting the fixed-effect parameters 
as part of the PIRLS loop rather than maximizing over them 
explicitly); I would rather compare the nAGQ=1 case, just to minimize 
the number of differences.

On 5/21/21 6:30 AM, John Maindonald wrote:

The code that follows below demonstrates what I find a very odd
issue with r-sig-mixed-models at r-project.org 
<mailto:r-sig-mixed-models at r-project.org><mailto:r-sig-mixed-models at r-project.org 
<mailto:r-sig-mixed-models at r-project.org>>, and with the comparison 
between
glmmTMB::glmmTMB and lme4::glmer, for models that have the
same model formulae.
1) In the glmmTMB model as fitted to `ffly`, one predicted value is
infinite. ?This, even though all coefficients and standard errors
have finite values. ?This surely indicates that there is something
odd in the way that predictions on the scale of the linear
predictor are calculated.
2) Initially, I'd thought that the `Inf` value explained the
difference between the two AIC values. ?But the difference
pretty much stays the same when the "offending" point is
removed.
I've not checked what happens when there are no observation level
random effects. ?Is there some difference in the way that the
model formulae are interpreted in the two cases?
ffly <- 
read.csv('https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv <https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv>')
ffly$obs <- factor(ffly$obs)
form1 <- cbind(Dead,Live)~0+trtGp/TrtTime+(1|obs)+(1|trtGpRep)
library(lme4); library(glmmTMB)
ObsTMB.cll <- glmmTMB(form1,
??????????????????????family=binomial(link="cloglog"), data=ffly)
Obsglmer.cll <- glmer(form1, nAGQ=0,
??????????????????????family=binomial(link="cloglog"), data=ffly)
round(AIC(Obsglmer.cll, ObsTMB.cll), 2)
## ?????????????????????df ???AIC
## Obsglmer.cll 14 639.44
## ObsTMB.cll ?14 636.02
round(c(max(predict(ObsTMB.cll)),max(predict(Obsglmer.cll))),3)
## [1] ??Inf 3.833
range(fixef(ObsTMB.cll))
## [1] -5.28780064553 ?1.15952249272
range(vcov(ObsTMB.cll))
## [1] -0.0546986905338 ?0.2887049942215
## Try also; there are small deviations from the line
plot(predict(Obsglmer.cll)~predict(ObsTMB.cll))
abline(0,1)
## Remove offending point
ffly1 <- ffly[-which.max(predict(ObsTMB.cll)),]
ObsTMB1.cll <- glmmTMB(form1,
??????????????????????family=binomial(link="cloglog"), data=ffly1)
Obsglmer1.cll <- glmer(form1, nAGQ=0,
??????????????????????family=binomial(link="cloglog"), data=ffly1)
cbind(AIC(Obsglmer.cll, ObsTMB.cll), AIC1=AIC(Obsglmer1.cll, 
ObsTMB1.cll)[,2])
## ??????????????????????df ???????????????????AIC ?????????????????AIC1
## Obsglmer.cll ?14 639.441888969 639.441889196
## ObsTMB.cll ??14 636.016597730 636.016597723
??## Observe that changes are in the final four decimal places.
round(rbind(glmer=fixef(Obsglmer1.cll),
TMB=unclass(fixef(ObsTMB1.cll)$cond)),3)
??????trtGpspAEgg trtGpspAL2 trtGpspAL3 trtGpspBEgg trtGpspBL2 trtGpspBL3
glmer ??????0.772 ????-2.255 ????-3.299 ?????-1.803 ????-2.632 ????-5.114
TMB ????????0.790 ????-2.367 ????-3.441 ?????-1.889 ????-2.757 ????-5.288
??????trtGpspAEgg:TrtTime trtGpspAL2:TrtTime trtGpspAL3:TrtTime
glmer ??????????????0.231 ?????????????0.372 ?????????????0.563
TMB ????????????????0.288 ?????????????0.398 ?????????????0.602
??????trtGpspBEgg:TrtTime trtGpspBL2:TrtTime trtGpspBL3:TrtTime
glmer ??????????????0.278 ?????????????0.517 ?????????????1.112
TMB ????????????????0.299 ?????????????0.554 ?????????????1.160
John Maindonald email: john.maindonald at anu.edu.au 
<mailto:john.maindonald at anu.edu.au><mailto:john.maindonald at anu.edu.au 
<mailto:john.maindonald at anu.edu.au>>
[[alternative HTML version deleted]]

_______________________________________________
R-sig-mixed-models at r-project.org 
<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models 
<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>

_______________________________________________
R-sig-mixed-models at r-project.org 
<mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models 
<https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>

John Maindonald

Fri, May 21, 2021 8:52 PM #

Thanks for the informative responses.

Well, it does look as though dealing with the `Inf` is as simple as
calculating eta_predict directly from the model for the linear predictor
and returning that. Why go the roundabout way and introduce
unnecessary inaccuracy that will in some cases manifest as a lurgy?

I?ve thought if it would be useful to collect a list of alternatives to default
control settings that users of these models have found to assist
convergence and/or avoid complaints of singularity in particular cases.
It does seem that some complaints are a consequence of the route
taken to get to convergence. Also, there is a dearth of published work
that comments on how the various choices of link function and error
model pan out in practical data analysis contexts, and (especially in
the 99% lethal time estimates for insects on which I have worked in
a plant quarantine context in the past)response , large elements of unrealism
about the confidence that can be placed in any upper tail limits and
confidence intervals.

Incidentally, I have been experimenting with observation level random
effects because they add to the binomial variance on the scale of the
linear predictor, rather than (as for betabinomial and quasibinomial)
multiplying on the scale of the response. For the dataset with which I
started, the variance then has to be ?corrected? by specifying a degree 2
or 3 polynomial or spline for the glmmTMB dispformula.

John Maindonald email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au>

On 22/05/2021, at 13:29, Ben Bolker <bbolker at gmail.com<mailto:bbolker at gmail.com>> wrote:

I figured out the infinite-prediction thing (didn't solve it, but figured it out):

https://github.com/glmmTMB/glmmTMB/issues/696

On 5/21/21 6:24 PM, John Maindonald wrote:

Setting nAGQ=1 does bring the AIC values closer together, albeit with
convergence failure when other parameters are left at their defaults.
Running allFit() does not provide (to me) any obvious clues on what
might work better.  Or, should one just accept the less strict tolerance?
Nor does centering and scale scTime help in this case.
llik <- summary(allFit(Obsglmer1.cll))$llik
max(lik)-lik
##                       bobyqa                   Nelder_Mead
##           0.00000000000e+00             1.31201379190e-04
##                   nlminbwrap                         nmkbw
##            1.76015646502e-09             3.48215053236e-06
##              optimx.L-BFGS-B nloptwrap.NLOPT_LN_NELDERMEAD
##            1.09246872171e-06             1.13990722639e-07
##    nloptwrap.NLOPT_LN_BOBYQA
##            5.81495783081e-07
John Maindonaldemail: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au> <mailto:john.maindonald at anu.edu.au>

On 22/05/2021, at 01:27, Ben Bolker <bbolker at gmail.com<mailto:bbolker at gmail.com> <mailto:bbolker at gmail.com>> wrote:

I don't know yet, I will dig in and try to see what's going on.  An infinite predicted value certainly seems like an issue.

 The most obvious difference is that nAGQ=0 is actually doing something slightly different (it's fitting the fixed-effect parameters as part of the PIRLS loop rather than maximizing over them explicitly); I would rather compare the nAGQ=1 case, just to minimize the number of differences.

On 5/21/21 6:30 AM, John Maindonald wrote:

The code that follows below demonstrates what I find a very odd
issue with r-sig-mixed-models at r-project.org<mailto:r-sig-mixed-models at r-project.org> <mailto:r-sig-mixed-models at r-project.org><mailto:r-sig-mixed-models at r-project.org <mailto:r-sig-mixed-models at r-project.org>>, and with the comparison between
glmmTMB::glmmTMB and lme4::glmer, for models that have the
same model formulae.
1) In the glmmTMB model as fitted to `ffly`, one predicted value is
infinite.  This, even though all coefficients and standard errors
have finite values.  This surely indicates that there is something
odd in the way that predictions on the scale of the linear
predictor are calculated.
2) Initially, I'd thought that the `Inf` value explained the
difference between the two AIC values.  But the difference
pretty much stays the same when the "offending" point is
removed.
I've not checked what happens when there are no observation level
random effects.  Is there some difference in the way that the
model formulae are interpreted in the two cases?
ffly <- read.csv('https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv<https://github.com/jhmaindonald/dataR/files/6521483/ffly.csv>')
ffly$obs <- factor(ffly$obs)
form1 <- cbind(Dead,Live)~0+trtGp/TrtTime+(1|obs)+(1|trtGpRep)
library(lme4); library(glmmTMB)
ObsTMB.cll <- glmmTMB(form1,
                      family=binomial(link="cloglog"), data=ffly)
Obsglmer.cll <- glmer(form1, nAGQ=0,
                      family=binomial(link="cloglog"), data=ffly)
round(AIC(Obsglmer.cll, ObsTMB.cll), 2)
##                      df    AIC
## Obsglmer.cll 14 639.44
## ObsTMB.cll  14 636.02
round(c(max(predict(ObsTMB.cll)),max(predict(Obsglmer.cll))),3)
## [1]   Inf 3.833
range(fixef(ObsTMB.cll))
## [1] -5.28780064553  1.15952249272
range(vcov(ObsTMB.cll))
## [1] -0.0546986905338  0.2887049942215
## Try also; there are small deviations from the line
plot(predict(Obsglmer.cll)~predict(ObsTMB.cll))
abline(0,1)
## Remove offending point
ffly1 <- ffly[-which.max(predict(ObsTMB.cll)),]
ObsTMB1.cll <- glmmTMB(form1,
                      family=binomial(link="cloglog"), data=ffly1)
Obsglmer1.cll <- glmer(form1, nAGQ=0,
                      family=binomial(link="cloglog"), data=ffly1)
cbind(AIC(Obsglmer.cll, ObsTMB.cll), AIC1=AIC(Obsglmer1.cll, ObsTMB1.cll)[,2])
##                       df                    AIC                  AIC1
## Obsglmer.cll  14 639.441888969 639.441889196
## ObsTMB.cll   14 636.016597730 636.016597723
  ## Observe that changes are in the final four decimal places.
round(rbind(glmer=fixef(Obsglmer1.cll),
TMB=unclass(fixef(ObsTMB1.cll)$cond)),3)
      trtGpspAEgg trtGpspAL2 trtGpspAL3 trtGpspBEgg trtGpspBL2 trtGpspBL3
glmer       0.772     -2.255     -3.299      -1.803     -2.632     -5.114
TMB         0.790     -2.367     -3.441      -1.889     -2.757     -5.288
      trtGpspAEgg:TrtTime trtGpspAL2:TrtTime trtGpspAL3:TrtTime
glmer               0.231              0.372              0.563
TMB                 0.288              0.398              0.602
      trtGpspBEgg:TrtTime trtGpspBL2:TrtTime trtGpspBL3:TrtTime
glmer               0.278              0.517              1.112
TMB                 0.299              0.554              1.160
John Maindonald email: john.maindonald at anu.edu.au<mailto:john.maindonald at anu.edu.au> <mailto:john.maindonald at anu.edu.au><mailto:john.maindonald at anu.edu.au <mailto:john.maindonald at anu.edu.au>>
_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> <mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>


_______________________________________________
R-sig-mixed-models at r-project.org<mailto:R-sig-mixed-models at r-project.org> <mailto:R-sig-mixed-models at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models <https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models>