Skip to content

Prediction/classification & variable selection

1 message · Voeten, C.C.

#
Dear Daniel,

Please keep the list in cc.

I know exactly what you mean, writing a loop and manually extracting AIC/BIC... that is exactly how I ended up writing package buildmer, which automates precisely that. Tim already suggested MuMIn to you as well; you would need to see which of the many packages (e.g. I can also think of lmerTest::step) serves your needs best. Buildmer's advantage is that it first tries to build up the maximal model from zero before doing backward elimination, so if your maximal model isn't capable of converging, buildmer will automatically give you the largest subset that does converge (plus removing terms that are not significant in backward elimination based on LRT/AIC/BIC/etc). I can't comment on the other packages, you would really need to experiment which of them works best for your own purposes.

Re multiple testing: in my view, for hypothesis testing based on p-values, you are only using one model, which you just happened to have done some pruning on first. In that sense, you wouldn't need to apply any corrections. However it is also well known that model selection will amplify spurious effects, and I also do see where authors like Hastie & Tibshirani are coming from, saying in their book that the standard errors of a pruned model are invalid because they don't take into account the selection procedure. Ultimately, this is a highly contested issue and you'd best either follow whatever is customary in your field, or use some kind of simulation-based approach to obtain p-values that do take into account the selection procedure. (I wouldn't really know how, and I am not aware of any literature giving a clear recipe for that, but maybe others have ideas here.)

For lasso/ridge, you definitely would not need to perform any manual corrections, as in those cases selection takes place as part of the fitting process as well, so the p-values will be correct in any case (well, barring of course the general issue of p-values in mixed models...).

Best,
Cesko