I am attempting to use randomForest to do classification and regression
tree analysis. After importing the data set, I use the following
statement:
tree <- randomForest(POET ~ DrainageArea + PctFines + L3_ER + ChanCon,
data=data, ntree=500, mtry=2,
replace=TRUE, importance=TRUE, do.trace=TRUE,
keep.forest=TRUE)
The first two independent variables are continuous numerical variables,
while the last two are categorical variables with more than two classes.
The package seems to handle this mixture of numerical and categorical
variables, but I am unclear how to interpret the splits for the
categorical variables.
The table describing the splits has a column, split point, which for
numerical variables is the value of the indicated variable where the
left daughter group is less than the value and the right daughter group
is greater than the value.
The documentation states that for categorical variables, split point is
a integer, whose binary expansion identifies which categories go into
the left and right daughter groups. It gives an example of a variable
with three classes and a split value of 5, which expands to 1 0 1. In
this case, the first and third classes go into the left daughter group
and the second class goes into the right daughter group.
My question now is: How does the package order the classes of a
categorical variable? This is not clear in the documentation, and if
this is something basic to R, I have not found it in the help files. In
my example, the variable, L3_ER, has five classes, DRAR, NCHF, NGPI,
NoLF, and WCBP. These levels are not ordered in any particular way in
the data set. I can think of two ways the package might order the
classes: 1. alphabetically or 2. in the order that the are first
encountered in the data set. Are either of these correct or might there
be some other way of ordering the levels I have not thought of?
A colleague suggested that I might use is.ordered(), but I get an error
message, "Error in inherits(x, "factor") : object "L3_ER" not found."
Any other suggestions are appreciated. Thanks.
Michael
Michael B. Griffith, Ph.D.
Research Ecologist
USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH 45268
telephone: 513 569-7034
e-mail: griffith.michael at epa.gov
Ordering of nominal or categorical variables in randomForest?
3 messages · Griffith.Michael at epamail.epa.gov, Karl Cottenie, Manjunatha Reddy
Michael, my guess is that the tree analysis uses the internal order of the factor levels. You can access this by printing the variable. If it is stored as a factor, it will first print all the individual factor values, followed by a line with x (in your case 5) levels: and the order in which the distinct levels are stored. You can also get this information with the "levels" function. See this example from ?factor
factor(letters[1:20], labels="letter")
[1] letter1 letter2 letter3 letter4 letter5 letter6 letter7 letter8 [9] letter9 letter10 letter11 letter12 letter13 letter14 letter15 letter16 [17] letter17 letter18 letter19 letter20 20 Levels: letter1 letter2 letter3 letter4 letter5 letter6 letter7 ... letter20 ##This is the line you are interested in Karl On Tue, 2008-07-08 at 13:57 -0400, Griffith.Michael at epamail.epa.gov wrote:
I am attempting to use randomForest to do classification and regression
tree analysis. After importing the data set, I use the following
statement:
tree <- randomForest(POET ~ DrainageArea + PctFines + L3_ER + ChanCon,
data=data, ntree=500, mtry=2,
replace=TRUE, importance=TRUE, do.trace=TRUE,
keep.forest=TRUE)
The first two independent variables are continuous numerical variables,
while the last two are categorical variables with more than two classes.
The package seems to handle this mixture of numerical and categorical
variables, but I am unclear how to interpret the splits for the
categorical variables.
The table describing the splits has a column, split point, which for
numerical variables is the value of the indicated variable where the
left daughter group is less than the value and the right daughter group
is greater than the value.
The documentation states that for categorical variables, split point is
a integer, whose binary expansion identifies which categories go into
the left and right daughter groups. It gives an example of a variable
with three classes and a split value of 5, which expands to 1 0 1. In
this case, the first and third classes go into the left daughter group
and the second class goes into the right daughter group.
My question now is: How does the package order the classes of a
categorical variable? This is not clear in the documentation, and if
this is something basic to R, I have not found it in the help files. In
my example, the variable, L3_ER, has five classes, DRAR, NCHF, NGPI,
NoLF, and WCBP. These levels are not ordered in any particular way in
the data set. I can think of two ways the package might order the
classes: 1. alphabetically or 2. in the order that the are first
encountered in the data set. Are either of these correct or might there
be some other way of ordering the levels I have not thought of?
A colleague suggested that I might use is.ordered(), but I get an error
message, "Error in inherits(x, "factor") : object "L3_ER" not found."
Any other suggestions are appreciated. Thanks.
Michael
Michael B. Griffith, Ph.D.
Research Ecologist
USEPA, NCEA (MS A-110)
26 W. Martin Luther King Dr.
Cincinnati, OH 45268
telephone: 513 569-7034
e-mail: griffith.michael at epa.gov
_______________________________________________ R-sig-ecology mailing list R-sig-ecology at r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-sig-ecology/attachments/20080708/c51c23e9/attachment.pl>