zero-inflation and multimodal count distribution

[Please keep r-sig-mixed-models at r-project.org in the Cc: ...]

On Wed, May 3, 2017 at 1:43 PM, simone santoro
Thank you for observing that, it makes much sense!
I have being exploring better the data and have realized that the response
variable varies greatly among years in a way that might explain why the
multimodal marginal distribution. I have also tried different combinations
of random intercepts/slopes and have found that the best fit is achieved by
defining nest identity and year as (independent) random intercepts
{(1|NF)+(1|YEAR)}. Now, I would refine my modelling approach and have tried
to model these data in a different way.

As told in the previous mail, actually, I am interested in testing whether
the ratio of sons relative to daughters contribution in terms of number of
descendants (nDesc) depends on the non-additive effect of parental quality
(parQuality). Then, I have tried a zero-inflated binomial and betabinomial
family. In these cases I have considered as response variable the number of
descendants generated by each individual (son or daughter), i.e. nDesc,
relative to the sum of descendants (sumDesc) generated by the whole nest
(sons + daughters). My response variable is therefore: nDesc/sumDesc

I would think that a betabinomial distribution might be better because I
know that the response variable has lot of zeros and lot of 1. By the way:
do exist zero-one-inflated beta (mixed) models as that discussed here?

Thus I try the betabinomial first:
tmbBetaBinomial<- glmmTMB(nDesc/sumDesc ~ parQuality*sex+(1|NF)+(1|YEAR),?,
zi~1,family= list(family="betabinomial",link="logit"), weights= sumDesc)
and I get this warning message:

Warning message:
In nlminb(start = par, objective = fn, gradient = gr) :
  NA/NaN function evaluation
This is probably a harmless error message.  It just means that the
optimizer wandered into illegal territory somewhere along the way.

Then I try the binomial:
tmbBinomial<- glmmTMB(nDesc/sumDesc ~ parQuality*sex+(1|NF)+(1|YEAR),?,
zi~1,family= binomial, weights= sumDesc)
I get no warning messages.

This is the summary of the betabinomial glmm:

summary(tmbBetaBinomial)
Family: betabinomial  ( logit )
Formula:          nDesc/sumDesc ~ parQuality * sex + (1 | NF) + (1 | YEAR)
Zero inflation:                  ~1
Data: ind
Weights: sumDesc

     AIC      BIC   logLik deviance df.resid
  5939.0   5983.4  -2961.5   5923.0     1886

Random effects:

Conditional model:
 Groups  Name        Variance  Std.Dev.
 NF (Intercept) 4.130e-09 6.426e-05
 YEAR    (Intercept) 1.328e+00 1.152e+00
Number of obs: 1894, groups:  NF, 719; YEAR, 24

Overdispersion parameter for betabinomial family (): 0.38

Conditional model:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)    0.099863   0.355641   0.281    0.779
parQuality        0.001557   0.004606   0.338    0.735
sexmal         0.002171   0.350213   0.006    0.995
parQuality:sexmal 0.000784   0.006104   0.128    0.898

Zero-inflation model:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    -27.8    64568.8       0        1

There is a strange z value for the zero-inflation model, what could it mean?
It means that the model tried to reduce the zero-inflation component
to zero.  It fits the zero-inflation probability on a logit scale, so
it can't make it *exactly* zero, but it can make it very very close.
Note that the NF variance is also effectively zero (std dev 10^5 times
smaller than the YEAR variance and about 10 times smaller than the
smallest-magnitude fixed effect). The AIC/log-likelihood for this
model are much better than for the binomial GLMM, so (alas) I would
advise you to take this model instead.

  In any case I think you should try plotting (or otherwise
inspecting) the predictions from both models, to understand what
they're saying.
This is the summary of the binomial glmm:

Family: binomial  ( logit )
Formula:          nDesc/sumDesc ~ parQuality * sex + (1 | NF) + (1 | YEAR)
Zero inflation:                  ~1
Data: ind
Weights: sumDesc

     AIC      BIC   logLik deviance df.resid
  6841.9   6880.8  -3414.0   6827.9     1887

Random effects:

Conditional model:
 Groups  Name        Variance  Std.Dev.
 NF (Intercept) 3.646e+01 6.0381644
 YEAR    (Intercept) 1.090e-08 0.0001044
Number of obs: 1894, groups:  NF, 719; YEAR, 24

Conditional model:
                Estimate Std. Error z value Pr(>|z|)
(Intercept)     7.386051   0.877055   8.421  < 2e-16 ***
parQuality        -0.007452   0.013641  -0.546  0.58483
sexmal         -0.649383   0.235775  -2.754  0.00588 **
parQuality:sexmal  0.009153   0.004007   2.284  0.02235 *
---
Signif. codes:  0 ?***? 0.001 ?**? 0.01 ?*? 0.05 ?.? 0.1 ? ? 1

Zero-inflation model:
            Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.03268    0.04625  -0.706     0.48

This would make biological sense to me because I would expect that sons
("sexmal") are advantaged relative to their sisters when the parental
quality is high. However, may I trust this result? is there something that
you see I am missing?
Best,

Simone

2017-05-02 18:31 GMT+02:00 Ben Bolker <bbolker at gmail.com>:

On 17-05-02 12:20 PM, simone santoro wrote:
Hi all,

I am trying to test a hypothesis regarding the different contribution of
sons and daughters to parents? fitness. I have a number of (bird) nests
of
which I have measured a feature of parents related to their quality
(continuous variable) that I hypothesize affects the future lifetime
fecundity of their sons and daughters.

Specifically, my hypothesis is that at high values of parents? quality
sons
will be more fecund than sisters through their entire life and vice
versa,
at low values of parents? quality, daughters will be more than brothers.

Note that sons and daughters of a nest, of which I have recorded their
lifetime fecundity, are born all the same year. Thus, year of birth (of
sons and daughters) is a random intercept I want to control for as it is
the nest identity. The data set may be arranged in two ways, one that
considers a row for each nest and another that considers a row for each
offspring (son or daughter).

In case 1 (row = nest), I have these variables: FN, family name; YEAR,
birth year of sons and daughter; nDescBySons, lifetime total number of
progeny generated by sons (pooled);  nDescByDaughs, lifetime total
number
of progeny generated by daughters (pooled); nSons, number of sons;
nDaughs,
number of daughters; parQuality, parents? quality.

In case 2 (row = son or daughter), I have these variables: FN, family
name;
YEAR, birth year of sons and daughter; nDesc, lifetime total number of
progeny generated by the individual; sex, son or daughter; nestSize,
total
number of sons and daughters at nest; parQuality, parents? quality.

In a way, I think that the second arrangement of data is easier to be
analyzed for testing my hypothesis (comment/suggestion on this?). In
this
way I have direct information on the individual-level lifetime fecundity
of
sons and daughters and have not necessarily to take care of how many
sons
and daughters were at the nest.
However, I have lot of zeros (many sons and daughters disappear ? die or
emigrate - and have no recorded descendants at all) and data have a kind
of
bimodal distribution after the zero mode (see below image):

https://drive.google.com/open?id=0BwsTfIcebsrOZnljSW9uQXF2UU0

Thus, I would use a zero-inflated GLMM as, for instance, by using
glmmTMB
package in R. Something like this:

glmmTMB(nDesc ~ parQuality*sex+(1|NF)+(1|YEAR),?, zi~1)

But, what about that ?ugly? multimodal distribution? I thought I may try
different distributions (e.g. poisson, compois, any other?) and compare
the
model fit by looking at the AIC.

Any advice on this would be extremely appreciated.

Simone

   My main thought is that your plots show the *marginal* distribution
of the data.  Differences among families/years or odd shapes of the
parental quality distribution could drive this pattern without any need
to assume the *conditional* distribution is multimodal.  Fit a sensible
model (like the one you suggest) and then check diagnostics in various
ways (if you have enough data, you could consider interactions between
sex and parental quality and the random effects -- e.g. does parental
quality matter more in some birth years than others?)

_______________________________________________
R-sig-mixed-models at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models