Skip to content

Getting output from predict.randomForest

2 messages · Griffith.Michael at epamail.epa.gov, Gavin Simpson

#
I have been trying to use randomForest and specifically predict for
randomForest as follows:

for (y in 7:42){
  data1 <- indata[c(1:5,y)]
  test1 <- test[c(1:5,y)])]
  data1 <- na.omit(data1))]
  test1 <- na.omit(test1))]
  set.seed(1234)
  tree=randomForest(x=data1[,2:5], y=data1[,6], ntree=1000, mtry=3,
     importance=TRUE, keep.forest=TRUE)
     summary(tree)
     print(tree)
     tree.predict <- predict(tree, test1[,2:6], type="response",
nodes=TRUE)
     table(observed = test1, predicted = tree.predict)
     varUsed(tree, count=TRUE)
}

The data set, data1, has the following form, with ERClass and ChanClass
being factors:

    FieldNum ERClass ChanClass DrainageArea   PctFines Clinger
1    04LM099       5                      1           10.2791962
0.000000      10
2    04LM127       5                      1           44.9838181
0.000000      10
3    96SC002       3                      1         668.9939004
0.000000      29
4    96SC037       3                      1         241.9048792
0.000000      23
5    97LS051        3                     1          342.3964136
0.000000      17
.
.
.

In this example, FieldNum is a sample identifier that is not used in the
analysis, Clinger is the dependent variable.  The other variables are
the independent variables.  The data set, test1, is a subset of 12
samples that were removed from data1 prior to the analysis with the same
variables.

What I would like is to get a prediction of the characteristics (i.e.,
something like ERClass = 3, ChanClass = 2 or 3, DrainageArea > 400,
PctFines < 10 - although I have found an example for a similar problem,
so I am not sure what it will look like exactly) of the end nodes where
the majority of the trees place each of these 12 samples).

However, the output I am currently getting is:

Call:
 randomForest(x = data1[, 2:5], y = data1[, 6], ntree = 1000,      mtry
= 3, importance = TRUE, keep.forest = TRUE)
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 3

          Mean of squared residuals: 17.6679
                    % Var explained: 49.65
Error in predict.randomForest(tree, test1[, 1:6], type = "response",
nodes = TRUE) :
  Type of predictors in new data do not match that of the training data.

Clearly, something is wrong with my predict statement, but what?  Do I
need to re-identify which variables are x and which variable is y?  If
so, how?  Also, am I going to get the result I am looking for?  If not,
how do I need to write this to get that?  The help pages I have found
have been very inadequate.

Thanks for your help.

Michael

Michael B. Griffith, Ph.D.
Research Ecologist

USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH  45268

telephone:  513 569-7034
e-mail:  griffith.michael at epa.gov
1 day later
#
On Fri, 2008-09-26 at 10:58 -0400, Griffith.Michael at epamail.epa.gov
wrote:
do str(indata) and str(test) give the same information regarding the
types of variables? If any of the variables used are factors, do the
factors have the same levels in indata and test?

I'd probably do this differently, and store the test and training data
in the same df to start with, and then split it out at random into a
training and test set object (or just use the indices on the main object
depending on whether I want the training or test rows).

This way, the variables will be the same type/format/structure as they
came from the same df to begin with.

Also, I really don't follow your loop code. You seem to be indexing
indata without reference to columns/rows in first line within the loop.
There also seem to be several syntax errors - too many "]"?

So start simple, set y <- 7 and perform the first run of the loop "by
hand" and once that works, then do the loop in full.

G