Skip to content

problems with large data II

1 message · Liaw, Andy

#
If you have a large enough machine, you'll be able to run randomForest with
that size data (we have done that regularly).  One thing that many people
don't seem to realize is that the "formula interface" has significant
overhead.  For large data sets, try running randomForest without using the
formula.  Other tips are: If you don't need to predict future data, set
keep.forest to FALSE.  Storing the forest takes lots of memory.  If you
already have the test set data, give it to randomForest along with the
training data, instead of using predict() afterward.  If you have a
classification problem, try using the sampsize option to reduce the number
of cases used to grow each tree.

As to the problem of having categorical predictors with more than 32
categories:  Prof. Breiman's new version can deal with categorical
predictors with (IMHO) obscene number of categories.  However I have chosen
to give that a very low priority for adding to the R package.  The reason is
that, IMHO, such variables need some massaging (collapsing/merging/whatever)
before they will be somewhat meaningful in a model, anyway.  (And personally
I have no need for such feature.)

HTH,
Andy
.skoda-auto.cz/action/fast