Statistical consultation GLMM - R-SIG-mixed-models

Thu, Jul 22, 2021 4:45 PM #
Hi,

 My name is Estefan?a; I am doing a master's degree in Marine Ecology. As
part of the project, we are dealing with shorebird count data, which have
been taken along the coast of California and northwestern Mexico. The
surveys are conducted under a standardized monitoring protocol. Sampling
units have been established at each of the sites, polygons with different
sizes, which vary from site to site. The birds present in each unit have
been counted year after year from 2011 to 2019 one time in winter. In
addition to the above, the count data in this case, given the nature of the
birds to congregate, make that many units have zeros, and some units have
abundances of 1000 birds or more, making the data do not approximate to a
normal distribution. Therefore, to treat these data, we use Generalized
Linear Mixed Models (GLMM) to contemplate the variability in bird abundance
from site to site and from the sampling unit to the sampling unit.
The objective of my work is to know the population trend of three species
of shorebirds (analyzed separately), and if there is a relationship with
environmental variables such as average temperature, minimum, and maximum
temperature, and precipitation; and if there is a difference between
regions, in this case, were grouped sites in California, those of the Baja
California peninsula and another region of northwestern Mexico, that we
called Continental.
Initially, I tested which distribution family fit the data by testing a
Poisson, Poisson zero-inflated, and negative binomial and negative binomial
zero-inflated distribution, which are the most common for count data. The
distribution that obtained the lowest AIC was the negative binomial
zero-inflated.
Knowing that there could be a correlation between the predictor variables,
I calculated their correlations and for the time we defined that since the
correlation between the years and the environmental variables was low <.30,
a single model would be made, in which the year, we also decided that the
size of each of the sampling units (logarithm of the hectares) would be
included since it is different in each unit, and we want to take that into
account. The region would also be considered as a factor with 3 levels.
Still, the temperature variables did present high correlations, but are the
variables we are interested in so, this is where I have several doubts
because my formation is not statistical
1.-Should I not include environmental variables in a single model because
they are correlated,  although they are of interest?
2.-If what I am doing is right or not?
3.-How do I know if I have made a good fit of the data to the model? How do
I test it?
4.-How do I select the best model?
5.-What assumptions should I test?
7.- Am I missing something obvious?

All the above I have done with the glmmTMB package in Rstudio.
Thank you very much and sorry in advance if these are very basic questions.

The fit I try so far is this:
m2znb.all<-glmmTMB(total~ logha + YearCollected + Geopolitical + tmp + tmn
+ tmx + pre + (1|Site/Plot), ziformula = ~1, data = mc2, family="nbinom2")
where:
total is the abundance of a species of shorebird
logha the size of the unit (logarithmic of the hectare)
YearCollected
Geopolitical is the region
tmp is the mean temperature
tmn is the minimum temperature
tmx is the maximum temperature
pre is the precipitation

It would be possible to share the data

Regards, Estefan?a.