Hi all, I have a dataset where each point is assigned to a class A, B, C, or D. Each point is also assigned to a study site. Each study site is coded with a number ranging between 1-100. This information is stored in the vector studySites. I want to run randomForests using stratified sampling, so I chose the option strata = factor(studySites) But I am not sure how to control the number of samples taken from each study site. I tried to use 10 points from each study site: mySampSize = rep(10, 100) So my function call looks like: RF = randomForest(myClass~., data=myData, mtry=5, importance=TRUE, strata = factor(studySites), sampsize=mySampSize) But randomForest gives me the following error: Error in randomForest.default(m, y, ...) : sampsize can not be larger than class frequency Does anybody have any idea why this happens? Thank you very much, Naiara.
sampsize in Random Forests
2 messages · Naiara Pinto, Liaw, Andy
Are you sure there are 100 sites in your data? Here's an example:
R> library(randomForest)
randomForest 4.5-23
Type rfNews() to see new features/changes/bug fixes.
R> f <- factor(sample(1:4, nrow(iris), replace=TRUE))
R> rf1 <- randomForest(iris[1:4], iris[[5]], strata=f, sampsize=rep(5,
nlevels(f)))
R> rf1
Call:
randomForest(x = iris[1:4], y = iris[[5]], strata = f, sampsize =
rep(5, nlevels(f)))
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 2
OOB estimate of error rate: 4.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 4 46 0.08
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Naiara Pinto Sent: Sunday, March 09, 2008 5:19 PM To: r-help at r-project.org Subject: [R] sampsize in Random Forests Hi all, I have a dataset where each point is assigned to a class A, B, C, or D. Each point is also assigned to a study site. Each study site is coded with a number ranging between 1-100. This information is stored in the vector studySites. I want to run randomForests using stratified sampling, so I chose the option strata = factor(studySites) But I am not sure how to control the number of samples taken from each study site. I tried to use 10 points from each study site: mySampSize = rep(10, 100) So my function call looks like: RF = randomForest(myClass~., data=myData, mtry=5, importance=TRUE, strata = factor(studySites), sampsize=mySampSize) But randomForest gives me the following error: Error in randomForest.default(m, y, ...) : sampsize can not be larger than class frequency Does anybody have any idea why this happens? Thank you very much, Naiara.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachme...{{dropped:15}}