Skip to content

Improving residual distribution in glmmTMB

7 messages · Ben Bolker, Timothy MacKenzie

#
Greetings,

I'm modeling the number of downloads (downs) of certain forms
(Document) by a number of school districts (District) of various
student sizes (ELs).

I've tried genpois(), nbinom1(), and nbinom2() families but the fitted
vs. residual doesn't look promising.

In glmmTMB, are there any additional ways to improve the residual
distribution for count-type data?

Thank you,
Tim M
### Reproducible data and code:
d = read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")

model = glmmTMB(downs ~ I(ELs/20) + Document +
                           (1|District) +
                           (1|Document),
                         family=genpois(),
                     #  ziformula = ~ Document,
                  # dispformula = ~ Document,
                         data = d)

plot(resid(model,type="pearson") ~ fitted(model))
1 day later
#
I don't have any perfect solution, and I don't necessarily recommend 
the brute-force approach tried below, but a truncated negative binomial 
(2/quadratic parameterization) seems best. The genpois and nbinom1 are 
actually wonkiest, in terms of appearance of residuals.
   The COM-Poisson below is *very* slow, you might want to skip it (it 
took 20 minutes to fit on 16 cores, and the results are not that great 
anyway).
   I thought about implementing a Poisson-inverse-Gamma to allow for 
heavier tails, but haven't gotten around to it yet (it's not trivial ...)
   There seems to be something wonky about the reported sigma/Pearson 
residuals from truncated nbinom2, but the results are OK ...

### Reproducible data and code:
d <- read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
library(glmmTMB)
library(purrr)
library(broom.mixed)
library(dplyr)
library(DHARMa)

model <- glmmTMB(downs ~ I(ELs/20) +
                     ## Document +
                     (1|District) +
                     (1|Document),
                 family=genpois(),
                      #  ziformula = ~ Document,
                   # dispformula = ~ Document,
                 data = d)

## very slow ...
system.time(model2 <- update(model, family = "compois",
                              control = glmmTMBControl(parallel = 16)))

##      user    system   elapsed
## 17194.076    11.103  1134.367

## truncated genpois doesn't work for some reason ...
fam <- c("genpois", "poisson", "nbinom1", "nbinom2")
fam <- c(fam, paste0("truncated_", fam[-1]))
mod_list <- lapply(fam,
                    function(f) {
                        cat(f, "\n")
                        update(model, family = f)
})
names(mod_list) <- fam
mod_list[["compois"]] <- model2


aictab <- (map_dfr(mod_list, glance, .id = "family")
     |> select(-c(nobs,logLik, BIC))
     |> mutate(across(c(AIC, deviance), ~ . - min(., na.rm = TRUE)))
     |> arrange(AIC)
)

##   family              sigma   AIC df.residual deviance
##   <chr>               <dbl> <dbl>       <int>    <dbl>
## 1 truncated_nbinom2 3.66e-8    0          569      NA
## 2 nbinom2           1.39e+0  665.         569       0
## 3 compois           5.29e+9  682.         569      NA
## 4 genpois           1.33e+1  768.         569      NA
## 5 nbinom1           8.11e+0  981.         569     191.
## 6 truncated_poisson 1   e+0 1996.         570      NA
## 7 poisson           1   e+0 2553.         570    2077.
## 8 truncated_nbinom1 7.84e+0   NA          569      NA

tnb <- mod_list[["truncated_nbinom2"]]
## Pearson resid calc messed up ???
plot(resid(tnb,type="pearson") ~ fitted(tnb))

plot(d$downs, predict(tnb, type = "response"), log = "xy")
abline(a=0, b=1, col = 2)

## reorder
mod_list <- mod_list[aictab$family]
op <- par(mfrow = c(3, 3), mfrow = c(3,2,1,1), las = 1)
lapply(mod_list,
        function(m) plot(resid(m,type="pearson") ~ fitted(m),
                         xlab = "", ylab = "", main = family(m)$family)
        )


## DHARMa
ss <- simulateResiduals(tnb)
plot(ss)
plotResiduals(ss, form = d$ELs/20)
On 2023-05-13 11:55 p.m., Timothy MacKenzie wrote:

  
    
#
Thank you, Ben, that's very helpful. I noticed that you removed the
fixed effect for "Document", that's, I'm guessing, to improve the
model fit (ex. avoiding singularity)? I was hoping to explore how many
times, on average, each unique Document was downloaded. Looks like
that may not be possible.

Thanks,
Tim M
On Mon, May 15, 2023 at 9:37?AM Ben Bolker <bbolker at gmail.com> wrote:
#
In general it doesn't make sense to include any term as both a fixed 
effect and a random-effect grouping variable (with a few exceptions).

   There's no reason you can't examine the model predictions of how 
often each document is downloaded (i.e., use predict() with each 
document and whatever baseline covariate settings you want)
On 2023-05-15 6:52 p.m., Timothy MacKenzie wrote:

  
    
#
I'm likely misunderstanding your recommendation for using predict()
for each Document. If our model has no fixed effect for Document, I'm
wondering how then we can predict() how often each document is
downloaded?

Do you mean averaging across predictions for each category of Document as in:

library(dplyr)
data.frame(d, fitted = predict(model, type = "response")) %>%
  group_by(Document) %>% mutate(ave_downs = mean(downs))

Tim M

model <- glmmTMB(downs ~ I(ELs/20) +
                    ## Document +
                   (1|District)+
                   (1|Document),
                 family=truncated_nbinom2(),
                 data = d)
On Mon, May 15, 2023 at 6:18?PM Ben Bolker <bbolker at gmail.com> wrote:
#
By default (the re.form argument to predict() is set to NULL), the 
predictions include the random effects.  So a 'newdata' data frame like:

ELs   Document  District
20    1         NA
20    2         NA
20    3         NA

would give you the predicted values for document 1, 2, 3, given ELs = 
20, and making a population-level prediction for District (i.e. *not* 
using the District RE/setting this info to 0)
On 2023-05-15 8:46 p.m., Timothy MacKenzie wrote:

  
    
#
Many thanks, I wasn't aware of this. Looking forward to one day trying
the same data with the family=Poisson-inverse-Gamma().

Much appreciated,
Tim M
On Mon, May 15, 2023 at 7:52?PM Ben Bolker <bbolker at gmail.com> wrote: