Skip to content

sampsize in Random Forests

2 messages · Naiara Pinto, Liaw, Andy

#
Hi all,

I have a dataset where each point is assigned to a class A, B, C, or
D. Each point is also assigned to a study site. Each study site is
coded with a number ranging between 1-100. This information is stored
in the vector studySites.

I want to run randomForests using stratified sampling, so I chose the option
strata = factor(studySites)

But I am not sure how to control the number of samples taken from each
study site. I tried to use 10 points from each study site:
mySampSize = rep(10, 100)

So my function call looks like:
RF = randomForest(myClass~., data=myData, mtry=5, importance=TRUE,
strata = factor(studySites), sampsize=mySampSize)

But randomForest gives me the following error:
Error in randomForest.default(m, y, ...) :
sampsize can not be larger than class frequency

Does anybody have any idea why this happens?

Thank you very much,

Naiara.
#
Are you sure there are 100 sites in your data?  Here's an example:

R> library(randomForest)
randomForest 4.5-23
Type rfNews() to see new features/changes/bug fixes.
R> f <- factor(sample(1:4, nrow(iris), replace=TRUE))
R> rf1 <- randomForest(iris[1:4], iris[[5]], strata=f, sampsize=rep(5,
nlevels(f)))
R> rf1

Call:
 randomForest(x = iris[1:4], y = iris[[5]], strata = f, sampsize =
rep(5,      nlevels(f))) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.67%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          4        46        0.08
------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachme...{{dropped:15}}