I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed
Logistic regression problem
6 messages · milicic.marko, Milicic B. Marko, Frank E Harrell Jr +2 more
2 days later
The only solution I can see is fitting all possib le 2 factor models enabling interactions and then assessing if interaction term is significant... any more ideas?
Milicic B. Marko wrote:
I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/Logistic-regression-problem-tp19704948p19746846.html Sent from the R help mailing list archive at Nabble.com.
Milicic B. Marko wrote:
The only solution I can see is fitting all possib le 2 factor models enabling interactions and then assessing if interaction term is significant... any more ideas?
Please don't suggest such a thing unless you do simulations to back up its predictive performance, type I error properties, and the impact of collinearities. You'll find this approach works as well as the U.S. economy. Frank Harrell
Milicic B. Marko wrote:
I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
So...I wouldn't suggest the trying all possible logistic models approach either and I'm not sure exactly what your goals are in modeling. However, I've been fiddling around with the variable importance (varimp) functions that come with the randomForest and party packages. The idea is to get an idea of which independent variables are likely to be useful and then to focus on those variables (identified as being of high importance) with more attention than you could spend on the whole set. A general advantage of the recursive partitioning approach is that it deals fairly nicely with interactions and collinearity. Theoretically, the recursive partitioning approaches should be able to deal with missing values (often a problem with large datasets), but I have been unable to apply this with the variable importance functions. Let me know if you require more details. You can check out http://www.biomedcentral.com/1471-2105/9/307 for a couple examples of variable importance. Jason Jones, PhD Medical Informatics j.jones at imail.org 801.707.6898 -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Frank E Harrell Jr Sent: Tuesday, September 30, 2008 2:54 PM To: Milicic B. Marko Cc: r-help at r-project.org Subject: Re: [R] Logistic regression problem
Milicic B. Marko wrote:
The only solution I can see is fitting all possib le 2 factor models enabling interactions and then assessing if interaction term is significant... any more ideas?
Please don't suggest such a thing unless you do simulations to back up its predictive performance, type I error properties, and the impact of collinearities. You'll find this approach works as well as the U.S. economy. Frank Harrell
Milicic B. Marko wrote:
I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Em S?b, 2008-09-27 ?s 10:51 -0700, milicic.marko escreveu:
I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed
milicic.marko
I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff for continous variables
Bernardo Rangel Tura, M.D,MPH,Ph.D National Institute of Cardiology Brazil
Bernardo Rangel Tura wrote:
Em S?b, 2008-09-27 ?s 10:51 -0700, milicic.marko escreveu:
I have a huge data set with thousands of variable and one binary variable. I know that most of the variables are correlated and are not good predictors... but... It is very hard to start modeling with such a huge dataset. What would be your suggestion. How to make a first cut... how to eliminate most of the variables but not to ignore potential interactions... for example, maybe variable A is not good predictor and variable B is not good predictor either, but maybe A and B together are good predictor... Any suggestion is welcomed
milicic.marko
I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff for continous variables
I cannot imagine a worse way to formulate a regression model. Reasons include 1. Results of recursive partitioning are not trustworthy unless the sample size exceeds 50,000 or the signal to noise ratio is extremely high. 2. The type I error of tests from the final regression model will be extraordinarily inflated. 3. False interactions will appear in the model. 4. The cutoffs so chosen will not replicate and in effect assume that covariate effects are discontinuous and piecewise flat. The use of cutoffs results in a huge loss of information and power and makes the analysis arbitrary and impossible to interpret (e.g., a high covariate value:low covariate value odds ratio or mean difference is a complex function of all the covariate values in the sample). 5. The model will not validate in new data. Frank
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University