randomForest() for regression produces offset predictions
I would expect this regression towards the mean behavior on a new or hold out dataset, not on the training data. In RF terminology, this means that the model prediction from predict is the in-bag estimate, but the out-of-bag estimate is what you want for prediction. In Joshua's example, rf.rf$predicted is an out-of-bag estimate, but since newdata is given, it appears that the result is the in-bag estimate, which still needs an adjustment like Joshua's (and perhaps a more complex one might be needed in some cases). This is a bit confusing since predict() usually matches what's in model$fitted.values. I imagine that's why the author used "predicted" as the component name instead of the standard "fitted.values". The documentation for predict.randomForest explains: "newdata - a data frame or matrix containing new data. (Note: If not given, the out-of-bag prediction in object is returned. "
Patrick Burns wrote:
What I see is the predictions being less extreme than the actual values -- predictions for large actual values are smaller than the actual, and predictions for small actual values are larger than the actual. That makes sense to me. The object is to maximize out-of-sample predictive power, not in-sample predictive power. Or am I missing something in what you are saying? Patrick Burns patrick at burns-stat.com +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and "A Guide for the Unwilling S User") Joshua Knowles wrote:
Hi all, I have observed that when using the randomForest package to do regression,
the
predicted values of the dependent variable given by a trained forest are
not
centred and have the wrong slope when plotted against the true values. This means that the R^2 value obtained by squaring the Pearson correlation
are
better than those obtained by computing the coefficient of determination directly. The R^2 value obtained by squaring the Pearson can, however, be exactly reproduced by the coeff. of det. if the predicted values are first linearly transformed (using lm() to find the required intercept and
slope).
Does anyone know why the randomForest behaves in this way - producing
offset
predictions? Does anyone know a fix for the problem? (By the way, the feature is there even if the original dependent variable values are initially transformed to have zero mean and unit variance.) As an example, here is some simple R code that uses the available swiss dataset to show the effect I am observing. Thanks for any help. -- #### EXAMPLE OF RANDOM FOREST REGRESSION library(randomForest) data(swiss) swiss #Build the random forest to predict Infant Mortality rf.rf<-randomForest(Infant.Mortality ~ ., data=swiss) #And predict the training set again pred<-c(predict(rf.rf,swiss)) actual<-swiss$Infant.Mortality #Plotting predicted against actual values shows the effect (uncomment to
see
this)
#plot(pred,actual)
#abline(0,1)
# calculate R^2 as pearson coefficient squared
R2one<-cor(pred,actual)^2
# calculate R^2 value as fraction of variance explained
residOpt<-(actual-pred)
residnone<-(actual-mean(actual))
R2two<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
# now fit a line through the predicted and true values and
# use this to normalize the data before calculating R^2
fit<-lm(actual ~ pred)
coef(fit)
pred2<-pred*coef(fit)[2]+coef(fit)[1]
residOpt<-(actual-pred2)
R2three<-1-var(residOpt,na.rm=TRUE)/var(residnone, na.rm=TRUE)
cat("Pearson squared = ",R2one,"\n")
cat("Coeff of determination = ", R2two, "\n")
cat("Coeff of determination after linear fitting = ", R2three, "\n")
## END
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
View this message in context: http://www.nabble.com/randomForest%28%29-for-regression-produces-offset-predictions-tp14415517p14447468.html Sent from the R help mailing list archive at Nabble.com.