Skip to content

Random forests prediction

4 messages · Matt, Liaw, Andy

#
Hi all,

I have a strange problem when applying RF in R. 
I have a set of variables with which I obtain an AUC of 0.67.

I do have a second set of variables that have an AUC of 0.57. 

When I merge the first and second set of variables, the AUC becomes 0.64. 

I would expect the prediction to become better as I add variables that do
have some predictive power?
This is even more strange as the AUC on the training set increased when I
added more variables (while the AUC of the validation set thus decreased).

Is there anyone who has experienced the same and/or who know what could be
the reason?

Thanks,

Matthijs

--
View this message in context: http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409.html
Sent from the R help mailing list archive at Nabble.com.
2 days later
#
I don't think this is so hard to explain.  If you evaluate AUC using either OOB prediction or on a test set (or something like CV or bootstrap), that would be what I expect for most data.  When you add more variables (that are, say, less informative) to a model, the model has to look harder to find the informative ones, and thus you pay a penalty.  One exception to that is if some of the "new" variables happen to have very strong interaction with some of the "old" variables, then you may see improved performance.

I've said it several times before, but it seems to be worth repeating:  Don't use the training set for evaluating models:  that almost never make sense.

Andy


-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of matt
Sent: Friday, May 11, 2012 3:43 PM
To: r-help at r-project.org
Subject: [R] Random forests prediction

Hi all,

I have a strange problem when applying RF in R. 
I have a set of variables with which I obtain an AUC of 0.67.

I do have a second set of variables that have an AUC of 0.57. 

When I merge the first and second set of variables, the AUC becomes 0.64. 

I would expect the prediction to become better as I add variables that do
have some predictive power?
This is even more strange as the AUC on the training set increased when I
added more variables (while the AUC of the validation set thus decreased).

Is there anyone who has experienced the same and/or who know what could be
the reason?

Thanks,

Matthijs

--
View this message in context: http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}
#
But shouldn't it be resolved when I set mtry to the maximum number of
variables? 
Then the model explores all the variables for the next step, so it will
still be able to find the better ones? And then in the later steps it could
use the (less important) variables.

Matthijs

--
View this message in context: http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409p4629944.html
Sent from the R help mailing list archive at Nabble.com.
#
That's not how RF works at all.  The setting of mtry is irrelevant to this.

Andy 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of matt
Sent: Monday, May 14, 2012 10:22 AM
To: r-help at r-project.org
Subject: Re: [R] Random forests prediction

But shouldn't it be resolved when I set mtry to the maximum number of
variables? 
Then the model explores all the variables for the next step, so it will
still be able to find the better ones? And then in the later steps it could
use the (less important) variables.

Matthijs

--
View this message in context: http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409p4629944.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}