Skip to content
Prev 180180 / 398503 Next

questions on rpart (tree changes when rearrange the order of covariates?!)

From: Uwe Ligges
I recently tried writing adaboost.m1 using rpart, and was surprised that
with very small training set (say n=10 or 20), I get a large improvement
in test set accuracy if I randomly shuffle the columns in the data at
every adaboost iteration.  (With twonorm data, we're talking about 25%
error vs. 19%, using n=2000 test set.)  It turned out to be the way
rpart deals with ties--- first come, first win.  Without shuffling the
columns, rpart almost never pick any variable beyond the 10th.  (In
twonorm, all variables are equally important, so one would expect
roughly equal selection frequency.)  

I've gotten some pointers from Terry Therneau about where in the code to
check.  I may try to implement breaking ties at random (as I've done in
randomForest).  No promises, though...

Andy
Notice:  This e-mail message, together with any attachme...{{dropped:12}}