Skip to content

Random forests

2 messages · Naiara Pinto, Gavin Simpson

#
Dear all,

I would like to use a tree regression method to analyze my dataset. I
am interested in the fact that random forests creates in-bag and
out-of-bag datasets, but I also need an estimate of support for each
split. That seems hard to do in random forests since each tree is
grown using a subset of the predictor variables.

I was thinking of setting mtry = number of predictor variables,
growing several trees, and computing the support for each node as the
number of times that a certain predictor variable was chosen for that
node. Can this be implemented using random forests?

Thanks!

Naiara.
#
On Tue, 2007-12-18 at 16:27 -0600, Naiara Pinto wrote:
Hi Naiara,

I'm so not an expert here, but what you propose with mty = number of
predictors will give you a procedure known as bagging.

You talk about support for the split and then for the node. Is this just
a typo or are you interested in the two different things?

I'm not aware of how you do the latter in bagging or random forests as
the whole point is to grow large trees not pruned ones. As to the
former, trees are unstable, change the data used to train them just a
little and you can get a very different fitted tree.

Bagging and random forests exploit this to produce a better prediction
machine / classifier by using n poor trees rather than one best tree.
They do this by adding randomness to the procedure by bootstrap sampling
the training data, and in the case of random forest, randomly sampling a
small number, mtry, of available predictors to grow each tree. As such
there is no correspondence between the splits of one tree and the splits
of another, so trying to compare how many times a certain split in one
or more trees is formed by the same predictor. So it doesn't make sense
(to me it may to others) to focus on individual splits in the n trees.

I don't know what you mean exactly by "support", but if you are trying
to get a measure of how important each of your predictors is in
explaining variance in your response, then take a look at the
importance() function in the randomForest package. This produces a
couple of measures that allow you to determine which predictors
contribute most to reducing node impurity or MSE.

HTH

G