Improving residual distribution in glmmTMB
By default (the re.form argument to predict() is set to NULL), the predictions include the random effects. So a 'newdata' data frame like: ELs Document District 20 1 NA 20 2 NA 20 3 NA would give you the predicted values for document 1, 2, 3, given ELs = 20, and making a population-level prediction for District (i.e. *not* using the District RE/setting this info to 0)
On 2023-05-15 8:46 p.m., Timothy MacKenzie wrote:
I'm likely misunderstanding your recommendation for using predict()
for each Document. If our model has no fixed effect for Document, I'm
wondering how then we can predict() how often each document is
downloaded?
Do you mean averaging across predictions for each category of Document as in:
library(dplyr)
data.frame(d, fitted = predict(model, type = "response")) %>%
group_by(Document) %>% mutate(ave_downs = mean(downs))
Tim M
model <- glmmTMB(downs ~ I(ELs/20) +
## Document +
(1|District)+
(1|Document),
family=truncated_nbinom2(),
data = d)
On Mon, May 15, 2023 at 6:18?PM Ben Bolker <bbolker at gmail.com> wrote:
In general it doesn't make sense to include any term as both a fixed
effect and a random-effect grouping variable (with a few exceptions).
There's no reason you can't examine the model predictions of how
often each document is downloaded (i.e., use predict() with each
document and whatever baseline covariate settings you want)
On 2023-05-15 6:52 p.m., Timothy MacKenzie wrote:
Thank you, Ben, that's very helpful. I noticed that you removed the fixed effect for "Document", that's, I'm guessing, to improve the model fit (ex. avoiding singularity)? I was hoping to explore how many times, on average, each unique Document was downloaded. Looks like that may not be possible. Thanks, Tim M On Mon, May 15, 2023 at 9:37?AM Ben Bolker <bbolker at gmail.com> wrote:
I don't have any perfect solution, and I don't necessarily recommend
the brute-force approach tried below, but a truncated negative binomial
(2/quadratic parameterization) seems best. The genpois and nbinom1 are
actually wonkiest, in terms of appearance of residuals.
The COM-Poisson below is *very* slow, you might want to skip it (it
took 20 minutes to fit on 16 cores, and the results are not that great
anyway).
I thought about implementing a Poisson-inverse-Gamma to allow for
heavier tails, but haven't gotten around to it yet (it's not trivial ...)
There seems to be something wonky about the reported sigma/Pearson
residuals from truncated nbinom2, but the results are OK ...
### Reproducible data and code:
d <- read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
library(glmmTMB)
library(purrr)
library(broom.mixed)
library(dplyr)
library(DHARMa)
model <- glmmTMB(downs ~ I(ELs/20) +
## Document +
(1|District) +
(1|Document),
family=genpois(),
# ziformula = ~ Document,
# dispformula = ~ Document,
data = d)
## very slow ...
system.time(model2 <- update(model, family = "compois",
control = glmmTMBControl(parallel = 16)))
## user system elapsed
## 17194.076 11.103 1134.367
## truncated genpois doesn't work for some reason ...
fam <- c("genpois", "poisson", "nbinom1", "nbinom2")
fam <- c(fam, paste0("truncated_", fam[-1]))
mod_list <- lapply(fam,
function(f) {
cat(f, "\n")
update(model, family = f)
})
names(mod_list) <- fam
mod_list[["compois"]] <- model2
aictab <- (map_dfr(mod_list, glance, .id = "family")
|> select(-c(nobs,logLik, BIC))
|> mutate(across(c(AIC, deviance), ~ . - min(., na.rm = TRUE)))
|> arrange(AIC)
)
## family sigma AIC df.residual deviance
## <chr> <dbl> <dbl> <int> <dbl>
## 1 truncated_nbinom2 3.66e-8 0 569 NA
## 2 nbinom2 1.39e+0 665. 569 0
## 3 compois 5.29e+9 682. 569 NA
## 4 genpois 1.33e+1 768. 569 NA
## 5 nbinom1 8.11e+0 981. 569 191.
## 6 truncated_poisson 1 e+0 1996. 570 NA
## 7 poisson 1 e+0 2553. 570 2077.
## 8 truncated_nbinom1 7.84e+0 NA 569 NA
tnb <- mod_list[["truncated_nbinom2"]]
## Pearson resid calc messed up ???
plot(resid(tnb,type="pearson") ~ fitted(tnb))
plot(d$downs, predict(tnb, type = "response"), log = "xy")
abline(a=0, b=1, col = 2)
## reorder
mod_list <- mod_list[aictab$family]
op <- par(mfrow = c(3, 3), mfrow = c(3,2,1,1), las = 1)
lapply(mod_list,
function(m) plot(resid(m,type="pearson") ~ fitted(m),
xlab = "", ylab = "", main = family(m)$family)
)
## DHARMa
ss <- simulateResiduals(tnb)
plot(ss)
plotResiduals(ss, form = d$ELs/20)
On 2023-05-13 11:55 p.m., Timothy MacKenzie wrote:
Greetings,
I'm modeling the number of downloads (downs) of certain forms
(Document) by a number of school districts (District) of various
student sizes (ELs).
I've tried genpois(), nbinom1(), and nbinom2() families but the fitted
vs. residual doesn't look promising.
In glmmTMB, are there any additional ways to improve the residual
distribution for count-type data?
Thank you,
Tim M
### Reproducible data and code:
d = read.csv("https://raw.githubusercontent.com/fpqq/w/main/b.csv")
model = glmmTMB(downs ~ I(ELs/20) + Document +
(1|District) +
(1|Document),
family=genpois(),
# ziformula = ~ Document,
# dispformula = ~ Document,
data = d)
plot(resid(model,type="pearson") ~ fitted(model))
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering (Acting) Graduate chair, Mathematics & Statistics
> E-mail is sent at my convenience; I don't expect replies outside of
working hours.
_______________________________________________ R-sig-mixed-models at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
-- Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering (Acting) Graduate chair, Mathematics & Statistics
> E-mail is sent at my convenience; I don't expect replies outside of
working hours.
Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering (Acting) Graduate chair, Mathematics & Statistics > E-mail is sent at my convenience; I don't expect replies outside of working hours.