Skip to content

randomForest out of bag prediction

4 messages · Witold E Wolski, Michael Mayer, Bert Gunter +1 more

#
Hello,

I am just not sure what the predict.RandomForest function is doing...
I confused.

I would expect the predictions for these 2 function calls to predict the same:
```{r}
diachp.rf <- randomForest(quality~.,data=data,ntree=50, importance=TRUE)

ypred_oob <- predict(diachp.rf)
dataX <- data %>% select(-quality) # remove response.
ypred <- predict( diachp.rf, dataX )

ypred_oob == ypred
```
These are both out of bag predictions but ypred and ypred_oob are
actually they are very different.
ypred_oob    0    1
        0 1324  346
        1  493 2837
ypred    0    1
    0 1817    0
    1    0 3183

What I find even more disturbing is that 100% accuracy for ypred.
Would you agree that this is rather unexpected?

regards
Witek
#
predict(diachp.rf, dataX) returns the in-sample predictions, not the OOB predictions. The response variable ?quality? is only used during model fit, not during prediction. 

Since in-sample predictions of random forests are typically grossly overfitted by construction, extremely high accuracies are not unexpected.

Gesendet von Mail f?r Windows 10

Von: Witold E Wolski
Gesendet: Samstag, 12. Januar 2019 18:56
An: r-help at r-project.org
Betreff: [R] randomForest out of bag prediction

Hello,

I am just not sure what the predict.RandomForest function is doing...
I confused.

I would expect the predictions for these 2 function calls to predict the same:
```{r}
diachp.rf <- randomForest(quality~.,data=data,ntree=50, importance=TRUE)

ypred_oob <- predict(diachp.rf)
dataX <- data %>% select(-quality) # remove response.
ypred <- predict( diachp.rf, dataX )

ypred_oob == ypred
```
These are both out of bag predictions but ypred and ypred_oob are
actually they are very different.
ypred_oob    0    1
        0 1324  346
        1  493 2837
ypred    0    1
    0 1817    0
    1    0 3183

What I find even more disturbing is that 100% accuracy for ypred.
Would you agree that this is rather unexpected?

regards
Witek
#
Off topic.
But see here:
https://stats.stackexchange.com/questions/61405/random-forest-and-prediction

-- Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Sat, Jan 12, 2019 at 9:56 AM Witold E Wolski <wewolski at gmail.com> wrote:

            

  
  
#
See inline.
On Sat, Jan 12, 2019 at 9:56 AM Witold E Wolski <wewolski at gmail.com> wrote:

            
AFAIK these are, indeed, the out-of-bag predictions.
These are not out of bag predictions. dataX is interpreted as new data
(argument newdata), and it is assumed to contain entirely new
observations. Each observation in dataX is fed through all of the
trees and the predictions are then pooled. There is no out-of-bag here
- all of the new data observations are assumed to be independent of
the training set.
It is expected (and not disturbing) l if your training set had enough
variables (or signal) to create trees that fit the training data
perfectly.

HTH,

Peter