outliers using Random Forest

The thing to do is probably:

1. Use fairly large number of trees (e.g., 1000).
2. Run a few times and average the results.

The reason for the instability is sort of two fold:

1. The random forest algorithm itself is based on randomization.  That's why
it's probably a good idea to have 500-1000 trees to get more stable
proximity measures (of which the outlying measures are based on).

2. If you are running randomForest in unsupervised mode (i.e., not giving it
the class labels), then the program treats the data as "class 1", creates a
synthetic "class 2", and run the classification algorithm to get the
proximity measures.  You probably need to run the algorithm a few times so
that the result will be based on several simulated data, instead of just
one.

HTH,
Andy

outliers using Random Forest

Thread (2 messages)