My data is 50,000 instances of about 200 predictor values, and for all 50,000 examples I have the actual class labels (binary). The data is quite unbalanced with about 10% or less of the examples having a positive outcome and the remainder, of course, negative. Nothing suggests the data has any order, and it doesn't appear to have any, so I've pulled the first 30,000 examples to use as training data, reserving the remainder for test data. There are actually 3 distinct sets of class labels associated with the predictor data, and I've built 3 distinct models. When each model is used in predict() with the training data and true class labels, I get AUC values of 0.95, 0.98 and 0.98 for the 3 classifier problems. When I run these models against the 'unknown' inputs that I held out--the 20,000 instances--I get AUC values of about 0.55 or so for each of the three problems, give or take. I reran the entire experiment, but instead using 40,000 instances for the model building, and the remaining 10,000 for testing. The AUC values showed a modest improvement, but still under 0.60. I've looked at a) the number of unique values that each predictor takes on, and b) the number of values, for a given predictor, that appear in the test data that do not appear in the training data. I can eliminate variables that have very few non-null values, and those that have very few unique values (the two are largely the same), but I wouldn't expect this to have any influence on the model. I've already eliminated variables that are null in every instance, and duplicate variables having identical values for all instances. I have not done anything to check further for dependant variables, and don't know how to. Besides getting a clue, what might be my next best step? -- View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html Sent from the R help mailing list archive at Nabble.com.
Analyzing Poor Performance Using naiveBayes()
4 messages · C.H., Kirk Fleming
I think you have been hit by the problem of high variance. (overfitting) Maybe you should consider doing a feature selection perhaps using the chisq ranking from FSelector. And then training the Naive Bayes using the top n features (n=1 to 200) as ranked by chisq, plot the AUCs or F1 score from both training set and cross training set against n. From the graph, you can select the optimal number of n.
On Fri, Aug 10, 2012 at 6:40 AM, Kirk Fleming <kirkrfleming at hotmail.com> wrote:
My data is 50,000 instances of about 200 predictor values, and for all 50,000 examples I have the actual class labels (binary). The data is quite unbalanced with about 10% or less of the examples having a positive outcome and the remainder, of course, negative. Nothing suggests the data has any order, and it doesn't appear to have any, so I've pulled the first 30,000 examples to use as training data, reserving the remainder for test data. There are actually 3 distinct sets of class labels associated with the predictor data, and I've built 3 distinct models. When each model is used in predict() with the training data and true class labels, I get AUC values of 0.95, 0.98 and 0.98 for the 3 classifier problems. When I run these models against the 'unknown' inputs that I held out--the 20,000 instances--I get AUC values of about 0.55 or so for each of the three problems, give or take. I reran the entire experiment, but instead using 40,000 instances for the model building, and the remaining 10,000 for testing. The AUC values showed a modest improvement, but still under 0.60. I've looked at a) the number of unique values that each predictor takes on, and b) the number of values, for a given predictor, that appear in the test data that do not appear in the training data. I can eliminate variables that have very few non-null values, and those that have very few unique values (the two are largely the same), but I wouldn't expect this to have any influence on the model. I've already eliminated variables that are null in every instance, and duplicate variables having identical values for all instances. I have not done anything to check further for dependant variables, and don't know how to. Besides getting a clue, what might be my next best step? -- View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Per your suggestion I ran chi.squared() against my training data and to my delight, found just 50 parameters that were non-zero influencers. I built the model through several iterations and found n = 12 to be the optimum for the training data. However, results still no so good for the test data. Here are he results for both with the AUC values for n = 3 to 50, training data in the 0.97 range, test data in the 0.55 area. http://r.789695.n4.nabble.com/file/n4639964/Feature_Selection_02.jpg If the training and test data sets were not so indistinguishable, I'd assume something weird about the test data--but I can't tell the two apart using any descriptive, 'meta' statistics I've tried so far. Having double-checked for dumb errors and having still obtained the same results, I toasted everything and started from scratch--still the same performance on the test data. Maybe I take a break and reflect for 30 min. -- View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825p4639964.html Sent from the R help mailing list archive at Nabble.com.
As some additional information, I re-ran the model across the range of n = 50 to 150 (n being the 'top n' parameters returned by chi.squared), and this time used a completed different subset of the data for both training and test. Nearly identical results, with the typical train AUC about 0.98 and the typical test AUC about 0.56. The other change I made: 30k records (instances) for training this time and 20k for test. I'll check to see if the set of class labels I'm using (I'm currently only running one of the 3 sets) is the least balanced and if so, I'll grab the most balanced. However, none of the three sets is much better than 90/10 I don't think. -- View this message in context: http://r.789695.n4.nabble.com/Analyzing-Poor-Performance-Using-naiveBayes-tp4639825p4639985.html Sent from the R help mailing list archive at Nabble.com.