informal conventions/checklist for new predictive modeling packages
I agree with almost all, except the last point. Since I have
participated in wheel-reinvention lately, I agree with the bulk of
your comment. I don't think the fix is as easy as you suspect,
RSiteSearch won't help me find a function I need when I don't know the
magic words. Some R functions have such unexpected names that only a
fastidious source-code reader would find them ("pretty", for example).
But I agree with your concern.
But, as far as the last one is concerned, I think you are mistaken.
Explanation below.
On Wed, Jan 4, 2012 at 8:19 AM, Max Kuhn <mxkuhn at gmail.com> wrote:
(14) [OCD] For binary classification models, model the probability of the first level of a factor as the event of interest (again, for consistency) Note that glm() does not do this but most others use the first level.
When the DV is thought of as 0 and 1, and 1 is an "event" "success" or "win" and 0 is a "non event" "failure" or "loss", if there is to be a single predicted probability, I want it to be the probability of the higher outcome. glm is doing the thing I want, and I don't know of others that go the other way, except PROC LOGISTIC in SAS. And that has a long history of causing confusion and despair. I'd like to consider adding one thing to your list, though. I have wished (in this list and elsewhere) that there were a more regular approach for calculating "newdata" objects that are used in predict. Many packages have re-invented this (datadist in rms, effects), and almost nobody here agreed with my wish for a more standard approach. But if there were a standard approach, it would be much easier to hold up R as an alternative to Stata when users pop up with "marginal effects tables" from Stata that are very difficult to reproduce with R. Regards, pj
Paul E. Johnson Professor, Political Science 1541 Lilac Lane, Room 504 University of Kansas