Hi, everyone, I haven't found anything similar in the forum, so here's my problem (I'm no expert in R nor statistics): I have a data set of 59.000 cases with 9 variables each (fractional coverage of 9 different plant types, such as deciduous broad-leaved temperate trees or evergreen tropical trees etc.), which was generated by a vegetation model. In order to evaluate the quality of the vegetation model's output, I want to compare it to a land-cover data set which has 23 different land-cover types (such as needle leaved evergreen forest, dense broad-leaved forest, barren, etc.). A statistician advised me to use the randomForest package in R and using a sub-set to generate the random Forest, I get a very good prediction for the rest. However, I need to evaluate how meaningful this classification is in an ecological sense (boreal trees should not play a role in the definition of tropical land-cover types, for example), otherwise I cannot judge the quality of the vegetation model's output. Unfortunately, randomForest gives me about 15.000 splits of which about 5000 are end branches (rough guess), so it's very hard and time-consuming to check each single branch of one of the final trees for its ecological meaning. Is there any utility to summarize the characteristics of each of the 23 prediction classes? Such as "land-cover class 1 has less than 5% of plant types 1-5, 20-50% of plant type 7 and at least 30% of plant type 8". Or is there a more suitable method to classify my data? Thanks a lot in advance! Christoph ____________________________________________________________________________ Click on the following link for the Netherlands Environmental Assessment Agency(MNP)mission and contact information: http://www.mnp.nl/signature.html Klik op de volgende link voor missie en contactinformatie van het Milieu- en Natuurplanbureau (MNP): http://www.mnp.nl/signature.html
ecological meaning of randomForest vegetation classification?
2 messages · Christoph Muller, Liaw, Andy
Hi Christoph, I'm not exactly sure what you're looking for, but I'll take a stab anyway. The trees in a random forest is not designed to be interpreted as one would with an "ordinary" tree. There are several things you may try to see if they help you any. One is the distribution of votes. It looks like you are classifying each data point into one of many possible classes. RF with give you the fraction of trees in the forest that classified the observation as a particular class (and the class with the highest fraction of votes is the "predicted class"). Another is the partial dependence plot: You can use plot(importance(rf.object)) to see which variables are the most important, and then use partialPlot() to examine their marginal effects. These offer some clue of what the RF black box is doing, and hopefully will make some sense to you. Best, Andy From: Christoph Muller
Hi, everyone, I haven't found anything similar in the forum, so here's my problem (I'm no expert in R nor statistics): I have a data set of 59.000 cases with 9 variables each (fractional coverage of 9 different plant types, such as deciduous broad-leaved temperate trees or evergreen tropical trees etc.), which was generated by a vegetation model. In order to evaluate the quality of the vegetation model's output, I want to compare it to a land-cover data set which has 23 different land-cover types (such as needle leaved evergreen forest, dense broad-leaved forest, barren, etc.). A statistician advised me to use the randomForest package in R and using a sub-set to generate the random Forest, I get a very good prediction for the rest. However, I need to evaluate how meaningful this classification is in an ecological sense (boreal trees should not play a role in the definition of tropical land-cover types, for example), otherwise I cannot judge the quality of the vegetation model's output. Unfortunately, randomForest gives me about 15.000 splits of which about 5000 are end branches (rough guess), so it's very hard and time-consuming to check each single branch of one of the final trees for its ecological meaning. Is there any utility to summarize the characteristics of each of the 23 prediction classes? Such as "land-cover class 1 has less than 5% of plant types 1-5, 20-50% of plant type 7 and at least 30% of plant type 8". Or is there a more suitable method to classify my data? Thanks a lot in advance! Christoph
______________________________________________________________ ______________ Click on the following link for the Netherlands Environmental Assessment Agency(MNP)mission and contact information: http://www.mnp.nl/signature.html Klik op de volgende link voor missie en contactinformatie van het Milieu- en Natuurplanbureau (MNP): http://www.mnp.nl/signature.html ______________________________________________ R-help at stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}