Prediction variance (map) for predictions derived using RandomForest package
Dear Forrest, Thanks a lot for your tip. I think quantregForest is what we were looking for. It takes much more time to compute, but the method looks sound (http://jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf). I do simplify everything on the end and assume that I can derive upper and lower confidence limits for +/- 1 s.d. (0.15866, 1-0.15866) and then use this as the prediction variance, but this is probably as good as it goes. Here is the revised code: https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R Thank you all for your suggestions / opinions (very useful as usual). cheers, T. (Tom) Hengl Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm Network: http://profiles.google.com/tom.hengl Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
On 23/06/2013 15:08, Forrest Stevens wrote:
Hi Tom, I've done something similar in the past to visualize the distribution of the predictions attained for each observation across the many trees within a random forest while looking at various aspects of those ranges and correlating that with cross-validated prediction errors. It's relatively easy to generate and keep the predictions for every tree for each observation (pixel in your case) using the predict.all=TRUE argument: predictions <- predict(random_forest, newdata=x_data_new, predict.all=TRUE) Then to extract all of the individual trees' predictions for the first observation: predictions$individual[1] You can do this to get the mean and SD for each observation (note the mean should match the value in predictions$aggregate: y_data$rf_mean <- apply(predictions$individual, MARGIN=1, mean) y_data$rf_sd <- apply(predictions$individual, MARGIN=1, sd) y_data$rf_cv <- apply(predictions$individual, MARGIN=1, sd) In practice I've found during testing that the distribution of values (assuming the continuous regression case since you're looking at SD in the first place) is highly skewed. The range, SD, CV and other measures of distribution of the individual trees does not correlate well at all with prediction errors in my work. I kind of makes intuitive sense since the power of the random forest algorithm relies in the ensemble nature of the technique, and the randomness injected via variable sampling at each node and those measures of variation in the predictions I've looked at quickly become irrelevant as you scale up the number of trees in the forest. So your mileage may vary but I'd be interested to know what you find. You may also want to look at the excellent quantregForest package as it produces a randomForest object but also produces information on the quantiles and quantile range for each observation's prediction for you, including some nice plots that I've found useful. Sincerely, Forrest On Sun, Jun 23, 2013 at 5:51 AM, Tomislav Hengl <hengl at spatial-analyst.net> wrote:
Dear list, I have a question about the randomForest models. I'm trying to figure out a way to estimate the prediction variance (spatially) for the randomForest function (http://cran.r-project.org/web/packages/randomForest/). If I run a GLM I can also derive the prediction variance using:
demo(meuse, echo=FALSE) meuse.ov <- over(meuse, meuse.grid) meuse.ov <- cbind(meuse.ov, meuse at data) omm0 <- glm(log1p(om)~dist+ffreq, meuse.ov, family=gaussian()) om.glm <- predict.glm(omm0, meuse.grid, se.fit=TRUE) str(om.glm)
List of 3 $ fit : Named num [1:3103] 2.34 2.34 2.32 2.29 2.34 ... ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ... $ se.fit : Named num [1:3103] 0.0491 0.0491 0.0481 0.046 0.0491 ... ..- attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ... $ residual.scale: num 0.357 when I fit a randomForest model, I do not get any estimate of the model uncertainty (for each pixel) but just the predictions:
meuse.ov <- meuse.ov[-omm0$na.action,] x <- randomForest(log1p(om)~dist+ffreq, meuse.ov) om.rf <- predict(x, meuse.grid) str(om.rf)
Named num [1:3103] 2.49 2.49 2.51 2.44 2.49 ... - attr(*, "names")= chr [1:3103] "1" "2" "3" "4" ... Does anyone has an idea how to map the prediction variance (i.e. estimated or propagated error) for the randomForest models spatially? I've tried deriving a propagated error for the randomForest models (every fit gives another model due to random component):
l.rfk <- data.frame(om_1 = rep(NA, nrow(meuse.grid)))
for(i in 1:50){
+ suppressWarnings(suppressMessages(x <-
randomForest(log1p(om)~dist+ffreq, meuse.ov)))
+ l.rfk[,paste("om",i,sep="_")] <- predict(x, meuse.grid)
+ } ## takes ca 1 minute
meuse.grid$om.rfkvar <- om.rfk at predicted$var1.var + apply(l.rfk, 1, var)
but the prediction variance I get is rather small (much smaller than e.g. the GLM variance). Here is the complete code with some plots: R code: https://code.google.com/p/gsif/source/browse/trunk/meuse/RK_vs_RandomForestK.R Predictions UK vs randomForest-kriging: https://gsif.googlecode.com/svn/trunk/meuse/Fig_meuse_RK_vs_RFK.png thanx, T. (Tom) Hengl Url: http://www.wageningenur.nl/en/Persons/dr.-T-Tom-Hengl.htm Network: http://profiles.google.com/tom.hengl Publications: http://scholar.google.com/citations?user=2oYU7S8AAAAJ
_______________________________________________ R-sig-Geo mailing list R-sig-Geo at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-geo