Logistic regression X^2 test with large sample size (fwd)
On Jul 31, 2012, at 10:25 AM, M Pomati wrote:
Marc, thank you very much for your help. I've posted in on <http://math.stackexchange.com/questions/177252/x2-tests-to-compare-the-fit-of-large-samples-logistic-models
and added details.
I think you might have gotten a more statistically knowledgeable audience at: http://stats.stackexchange.com/ (And I suggested to the moderators at math-SE that it be migrated.)
David. > > Many thanks > > Marco > > --On 31 July 2012 11:50 -0500 Marc Schwartz <marc_schwartz at me.com> > wrote: > >> On Jul 31, 2012, at 10:35 AM, M Pomati <Marco.Pomati at bristol.ac.uk> >> wrote: >> >>> Does anyone know of any X^2 tests to compare the fit of logistic >>> models >>> which factor out the sample size? I'm dealing with a very large >>> sample and >>> I fear the significant X^2 test I get when adding a variable to >>> the model >>> is simply a result of the sample size (>200,000 cases). >>> >>> I'd rather use the whole dataset instead of taking (small) random >>> samples >>> as it is highly skewed. I've seen things like Phi and Cramer's V for >>> crosstabs but I'm not sure whether they have been used before on >>> logistic >>> regression, if there are better ones and if there are any packages. >>> >>> >>> Many thanks >>> >>> Marco >> >> >> Sounds like you are bordering on some type of stepwise approach to > including or not including covariates in the model. You can search > the list > archives for a myriad of discussions as to why that is a poor > approach. >> >> You have the luxury of a large sample. You also have the challenge of > interpreting covariates that appear to be statistically significant, > but > may have a rather small *effect size* in context. That is where > subject > matter experts need to provide input as to interpretation of the > contextual > significance of the variable, as opposed to the statistical > significance of > that same variable. >> >> A general approach, is to simply pre-specify your model based upon >> rather > simple considerations. Also, you need to determine if your goal for > the > model is prediction or explanation. >> >> What is the incidence of your 'event' in the sample? If it is say >> 10%, > then you should have around 20,000 events. The rule of thumb for > logistic > regression is to have around 20 events per covariate degree of > freedom (df) > to minimize the risk of over-fitting the model to your dataset. A > continuous covariate is 1 df, a k-level factor is k-1 df. So with > 20,000 > events, your model could feasibly have 1,000 covariate df's. I am > guessing > that you don't have that much independent data to begin with. >> >> So, pre-specfy your model on the full dataset and stick with it. >> Interact > with subject matter experts on the interpretation of the model. >> >> BTW, this question is really about statistical modeling generally, >> not > really R specific. Such queries are best posed to general statistical > lists/forums such as Stack Exchange. I would also point you to Frank > Harrell's book, Regression Modeling Strategies. >> >> Regards, >> >> Marc Schwartz >> > ---------------------- > M Pomati > University of Bristol > David Winsemius, MD Alameda, CA, USA