I have been trying to use randomForest and specifically predict for
randomForest as follows:
for (y in 7:42){
data1 <- indata[c(1:5,y)]
test1 <- test[c(1:5,y)])]
data1 <- na.omit(data1))]
test1 <- na.omit(test1))]
set.seed(1234)
tree=randomForest(x=data1[,2:5], y=data1[,6], ntree=1000, mtry=3,
importance=TRUE, keep.forest=TRUE)
summary(tree)
print(tree)
tree.predict <- predict(tree, test1[,2:6], type="response",
nodes=TRUE)
table(observed = test1, predicted = tree.predict)
varUsed(tree, count=TRUE)
}
The data set, data1, has the following form, with ERClass and ChanClass
being factors:
FieldNum ERClass ChanClass DrainageArea PctFines Clinger
1 04LM099 5 1 10.2791962
0.000000 10
2 04LM127 5 1 44.9838181
0.000000 10
3 96SC002 3 1 668.9939004
0.000000 29
4 96SC037 3 1 241.9048792
0.000000 23
5 97LS051 3 1 342.3964136
0.000000 17
.
.
.
In this example, FieldNum is a sample identifier that is not used in the
analysis, Clinger is the dependent variable. The other variables are
the independent variables. The data set, test1, is a subset of 12
samples that were removed from data1 prior to the analysis with the same
variables.
What I would like is to get a prediction of the characteristics (i.e.,
something like ERClass = 3, ChanClass = 2 or 3, DrainageArea > 400,
PctFines < 10 - although I have found an example for a similar problem,
so I am not sure what it will look like exactly) of the end nodes where
the majority of the trees place each of these 12 samples).
However, the output I am currently getting is:
Call:
randomForest(x = data1[, 2:5], y = data1[, 6], ntree = 1000, mtry
= 3, importance = TRUE, keep.forest = TRUE)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 3
Mean of squared residuals: 17.6679
% Var explained: 49.65
Error in predict.randomForest(tree, test1[, 1:6], type = "response",
nodes = TRUE) :
Type of predictors in new data do not match that of the training data.
Clearly, something is wrong with my predict statement, but what? Do I
need to re-identify which variables are x and which variable is y? If
so, how? Also, am I going to get the result I am looking for? If not,
how do I need to write this to get that? The help pages I have found
have been very inadequate.
Thanks for your help.
Michael
Michael B. Griffith, Ph.D.
Research Ecologist
USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH 45268
telephone: 513 569-7034
e-mail: griffith.michael at epa.gov
Getting output from predict.randomForest
2 messages · Griffith.Michael at epamail.epa.gov, Gavin Simpson
1 day later
On Fri, 2008-09-26 at 10:58 -0400, Griffith.Michael at epamail.epa.gov wrote:
I have been trying to use randomForest and specifically predict for
randomForest as follows:
for (y in 7:42){
data1 <- indata[c(1:5,y)]
test1 <- test[c(1:5,y)])]
data1 <- na.omit(data1))]
test1 <- na.omit(test1))]
set.seed(1234)
tree=randomForest(x=data1[,2:5], y=data1[,6], ntree=1000, mtry=3,
importance=TRUE, keep.forest=TRUE)
summary(tree)
print(tree)
tree.predict <- predict(tree, test1[,2:6], type="response",
nodes=TRUE)
table(observed = test1, predicted = tree.predict)
varUsed(tree, count=TRUE)
}
The data set, data1, has the following form, with ERClass and ChanClass
being factors:
FieldNum ERClass ChanClass DrainageArea PctFines Clinger
1 04LM099 5 1 10.2791962
0.000000 10
2 04LM127 5 1 44.9838181
0.000000 10
3 96SC002 3 1 668.9939004
0.000000 29
4 96SC037 3 1 241.9048792
0.000000 23
5 97LS051 3 1 342.3964136
0.000000 17
.
.
.
In this example, FieldNum is a sample identifier that is not used in the
analysis, Clinger is the dependent variable. The other variables are
the independent variables. The data set, test1, is a subset of 12
samples that were removed from data1 prior to the analysis with the same
variables.
What I would like is to get a prediction of the characteristics (i.e.,
something like ERClass = 3, ChanClass = 2 or 3, DrainageArea > 400,
PctFines < 10 - although I have found an example for a similar problem,
so I am not sure what it will look like exactly) of the end nodes where
the majority of the trees place each of these 12 samples).
However, the output I am currently getting is:
Call:
randomForest(x = data1[, 2:5], y = data1[, 6], ntree = 1000, mtry
= 3, importance = TRUE, keep.forest = TRUE)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 3
Mean of squared residuals: 17.6679
% Var explained: 49.65
Error in predict.randomForest(tree, test1[, 1:6], type = "response",
nodes = TRUE) :
Type of predictors in new data do not match that of the training data.
Clearly, something is wrong with my predict statement, but what? Do I
need to re-identify which variables are x and which variable is y? If
so, how? Also, am I going to get the result I am looking for? If not,
how do I need to write this to get that? The help pages I have found
have been very inadequate.
do str(indata) and str(test) give the same information regarding the types of variables? If any of the variables used are factors, do the factors have the same levels in indata and test? I'd probably do this differently, and store the test and training data in the same df to start with, and then split it out at random into a training and test set object (or just use the indices on the main object depending on whether I want the training or test rows). This way, the variables will be the same type/format/structure as they came from the same df to begin with. Also, I really don't follow your loop code. You seem to be indexing indata without reference to columns/rows in first line within the loop. There also seem to be several syntax errors - too many "]"? So start simple, set y <- 7 and perform the first run of the loop "by hand" and once that works, then do the loop in full. G
Thanks for your help. Michael Michael B. Griffith, Ph.D. Research Ecologist USEPA, NCEA (MS A-110) 26 W. Martin Luther King Dr. Cincinnati, OH 45268 telephone: 513 569-7034 e-mail: griffith.michael at epa.gov
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology