Skip to content
Prev 18128 / 20628 Next

bam model selection with 3 million data

Dear list,

I?m investigating the effect of three variables (X, Y, Z) on the
probability that an animal uses a particular habitat A. I have a time
series of relocations for each animal (>300 individuals), with one
relocation every 30 minutes. There are only two options for the response
variable: 1=present in habitat A, 0=not present in habitat A. The effects
of the three variables are expected to be non-linear so I?m using gam
models. My dataset is very large, with >3 million data points so I?m using
the bam function from the mgcv library in R. In my models I include a
random effect ?individual ID?, and a temporal autocorrelation term that
corrects much but not all of the autocorrelation in the models.

*Question 1.*

When I run a model with the three main effects (X, Y, Z) and the three
double interactions (X:Y, X:Z, Y:Z), I get that all terms are highly
significant, except for one interaction. If I remove it, then everything is
highly significant. However, I also wanted to run simpler models with only
one interaction, no interactions, only two main effects and only one main
effect. Then, if I compare all these models with AIC or BIC, I get that the
best model (by far) is the one with only main effects.
AIC(codcoaAR2,codcoaAR2.1,codcoaAR2.2,codcoaAR2.3,codcoaAR2.4,codcoaAR2.5,codcoaAR2.6,codcoaAR2.7,codcoaAR2.8,codcoaAR2.9,codcoaAR)

                  df      AIC

codcoaAR2   306.1310 -1442543

codcoaAR2.1 293.1608 -1440642

codcoaAR2.2 292.9615 -1438219

codcoaAR2.3 294.3657 -1435346

codcoaAR2.4 284.0026 -1434286

codcoaAR2.5 280.3472 -1396765

codcoaAR2.6 279.6380 -1435862

codcoaAR2.7 269.4968 -1377806

codcoaAR2.8 269.0480 -1393897

codcoaAR2.9 281.8584 -1214270

codcoaAR    271.7066 -2353481  # model with only main effects



I wonder how this is possible if two of the interactions are highly
significant.

So my underlying question is: *for a model like this in which sample size
is huge, should I make model selection looking at the significance of the
different terms in the model, or should I rather look at AIC/BIC?*

*Question 2.*

Let?s assume the model with only main effects is indeed the optimal one.
Then I?d like to get the effect size of each explanatory variable. It?s not
clear to me how to do it even after reading some post on this and other
forums, but I tried to figure it out by sequentially running the model
without one explanatory variable at a time, and then comparing the deviance
explained in the optimal model with X, Y, Z with the deviance explained
with the reduced model with only Y and Z, for instance. Assuming that the
difference would the variance explained by X. *Is this correct? *Looking at
the results, the deviance explained by each variable X, Y, Z is quite low,
but if the three main effects explain so little variance, who is explaining
the rest?

Model

Deviance explained

X, Z, Y

69.3%

Y, Z

68.5%

X, Z

69.3%

X, Y

60.5%



*Question 3.*

In my models I usually get this error message:

Warning message:

In bgam.fitd(G, mf, gp, scale, nobs.extra = 0, rho = rho, coef = coef,  :

  fitted probabilities numerically 0 or 1 occurred



which seems to indicate that there is perfect separation in my logistic
regression. I?m not sure this is the case in my data, how could I check it
and correct for it if needed? Should it be always corrected?



Thanks for your help,

David