I was hoping that someone well versed in the theory at the interface
of statistics and machine learning would take over, but since there
were no responders I'll give it a go, relying heavily on a quick
re-reading of Ch 7 of:
@book{hastie2001esl,
title={{The Elements of Statistical Learning: Data Mining,
Inference, and Prediction}},
author={Hastie, T. and Tibshirani, R. and Friedman, J.},
year={2001},
publisher={Springer}
}
I'll make a few comments in-line below, and then discuss some of the
main issues as I understand them. I'll try to wrap it all up so we
stay relevant to the original question.
On Fri, May 30, 2008 at 9:15 PM, David Hewitt <dhewitt37 at gmail.com> wrote:
We've mostly gotten out of the area where I know enough statistically to speak with confidence, but I'll risk some lumps anyway... I always thought that the idea of retaining a portion of the data for validation was a good idea. I asked David Anderson about this personally and he said he couldn't see any reason to do that. Using likelihood, he thought the best approach was to use all the data to determine the best model.
I agree that all of the data should be used to fit the best model, but ideally not all of it used to select the best model.
I'm pretty muddy on the difference between selecting a good model with AIC (which is sometimes referred to as being predictive in nature) and what is meant by post-hoc validation of predictive ability (aside from testing on another data set). I've often seen the "leave-one-out" approach used to "validate" a model. If anyone has a good reference that differentiates the two with an example, I'd really appreciate it.
The leave-one-out approach is a poor choice for model assessment
because the datasets the model is fit to are nearly identical,
resulting in a high-variance estimate. A good reference for these
issues is the Hastie et al reference given above. For a more
practical S/R approach with less focus on machine learning/data mining
and more on classes of models commonly used in ecology, there is some
useful validation information in
@book{harrell2001rms,
title={{Regression Modeling Strategies: With Applications to Linear
Models, Logistic Regression, and Survival Analysis}},
author={Harrell, F.E.},
year={2001},
publisher={Springer}
}
On Sun, Jun 1, 2008 at 7:19 AM, Ruben Roa Ureta <rroa at udec.cl> wrote:
I think it is a matter of principles. In my view statistical inference theory only covers estimation of parameters and prediction of new data GIVEN a model, whereas model selection requires a larger theory. The AIC fits very well in this view since Akaike?s theorem joins statistical inference theory with information theory. These two theories together provide the tools to make model selection (or model identification, sensu Akaike).
I'm not sure I understand how my comments about validating a model (or an ensemble of models) intended to have predictive ability fit into this.
I agree with Anderson that I would use always all my data to best fit my model with the likelihood. Cross-validation is ad hoc whereas the AIC is grounded on solid theory.
Yes, I agree that all of the data should be used in *fitting* the best model (regardless of whether you are using a likelihood based approach). I do not agree that cross-validation is not grounded in solid theory -- there is an abundance of theory, much of it developed by statisticians (including Brad Efron, Seymour Geisser, and many others cited in the references given above). More generally, I think it's worth distinguishing model selection from model assessment. AIC, AICc, BIC, Cp, etc are model selection tools. We can qualify this even more, I believe, by saying that they are tools designed to compare relative estimated predictive ability for (as Hastie et al say on pg 203) "a special class of estimates that are linear in the parameters". All of these tools can be shown to estimate the optimism caused by overfitting the data, and then adding that value to the observed error in the training data. Note that the optimism is the expected difference between the in-sample prediction error (i.e. error conditional on the observed values of the predictors) and the observed error in the training data. On the other hand the cross-validation methods (including various bootstrap estimates of prediction error) directly estimate the true prediction error (not conditional of the observed values of the predictors). For model selection it is reasonable to estimate the in-sample error because it is the relative differences in errors that matters not the actual values of the errors, but for general assessment of predictive accuracy a direct estimate of the "extra-sample error" via the cross-validation and bootstrap methods is generally better. Another issue to keep in mind is that the information criteria are based on a likelihood and so come along with a suite of assumptions, whereas cross-validation is non-parametric. Now to bring this all back to the original question. The poster stated that he had selected a model via AIC tables and expressed a desire to determine how "good" the model was. In my experience AIC-type tables are often used by folks who don't have a good understanding of what's going on under the hood (clearly not many people have the time and energy it requires to really understand the models, the probability distributions and likelihoods, the assumptions, the connection to information theory or Bayesian priors and posteriors and ratios thereof etc.). A common mistake is to assume that if a model has a "good" *IC score relative to the other models in their list it is a "good" model. Ben Bolker gave some good advice for checking how the model is doing: the GoF on the global model, the distributions of errors within groups, linearity, leverages, outliers etc. There are plenty of assumptions that come along with the modeling process and it is up to the modeler to demonstrate that the model meets them adequately (for some definition of adequately). My point was just to add that if the model is intended to have predictive ability, there are tools out there to assess that ability. Unfortunately there is no one-size-fits-all algorithm for how to do this. I mentioned that ideally there is enough data so that if the validation tools are going to effect the final choice of a model, then the data are split into 3 groups: test, validation and training. It seems these days there are more and more datasets that are very large and this luxury is feasible. It's completely context-dependent as to how large is large enough, but there is almost always a law of diminishing returns with sample size (think about the variance of x-bar for a sample size of 1 and sample size of 2 -- which cuts the variance in half -- and then think about how little effect on the variance going from a sample size of 100 to 101 has), so at some point holding out enough independent data to get a low-variance estimate of predictive ability (again, it's context dependent how you define predictive ability) is the 'right' thing to do. Even if you don't have that luxury, for the reasons described above, using an internal cross-validation technique such as the tools offered in Harrell's Design package or the errorest function in ipred (and a search of R for 'cross-validation' will reveal others) can often produce very helpful estimates of the predictive ability of your model. All that said, I'll end by throwing in my opinion that if the goal is prediction and not inference and interpretation of model parameters I would probably not use an AIC-type table. Model averaging with an AIC table helps, but there are usually better ways. The 'right' tool depends on the type of predictions wanted, but here are a few packages I like: gbm, mboost, nnet, randomforest, and e1071. Also there is a task view for machine learning: http://cran.r-project.org/web/views/MachineLearning.html Finally, here's a fun real world application of predictive tools (they're getting pretty close to the US$1Mil prize): http://www.netflixprize.com/leaderboard I was happy to see that the folks at the top are at least as much statisticians as they are computer scientists ;-) best, Kingsford Jones