Rpart and bagging - how is it done?

I believe that the procedure you describe at the end (resampling the cases) 
is the original interpretation of bagging, and that using weighting is 
equivalent when a procedure uses case weights.

If you are getting different results when replicating cases and when using 
weights then rpart is not using its weights strictly as case weights and it 
would be preferable to replicate cases.  But I am getting identical 
predictions by the two routes:

ind <- sample(1:81, replace=TRUE)
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
     weights=tabulate(ind, nbins=81), xval=0)

My memory is that rpart uses unweighted numbers for its control params 
(unlike tree) and hence is not strictly using case weights.  I believe you 
can avoid that by setting the control params to their minimum and relying on 
pruning.

BTW, it is inaccurate to call these trees 'non-pruned' -- the default
setting of cp is still (potentially) doing quite a lot of pruning.

Torsten Hothorn can explain why he chose to do what he did.  There's a small 
(but only small) computational advantage in using case weights, but the 
tricky issue for me is how precisely tree growth is stopped, and I don't 
think that rpart at its default settings is mimicing what Breiman was doing 
(he would have been growing much larger trees).

its mainly used to avoid repeated formula parsing and other data 
preprocessing steps everytime a tree is grown (which in my experience can 
be quite a substancial advantage both with respect to speed and memory 
consumption). As Brian said, rpart doesn't really interpret weights as
case weights and thus the example code from the book is not totally 
correct. However, for example, party::ctree accepts case weights.

Best wishes,

Torsten
On Thu, 6 Mar 2008, apjaworski at mmm.com wrote:

Hi there.

I was wondering if somebody knows how to perform a bagging procedure on a
classification tree without running the classifier with weights.

Let me first explain why I need this and then give some details of what I
have found out so far.

I am thinking about implementing the bagging procedure in Matlab.  Matlab
has a simple classification tree function (in their Statistics toolbox) but
it does not accept weights.  A modification of the Matlab procedure to
accommodate weights would be very complicated.

The rpart function in R accepts weights.  This seems to allow for a rather
simple implementation of bagging.  In fact Everitt and Hothorn in chapter 8
of "A Handbook of Statistical Analyses Using R" describe such a procedure.
The procedure consists in generating several samples with replacement from
the original data set.  This data set has N rows.  The implementation
described in the book first fits a non-pruned tree to the original data
set.  Then it generates several (say, 25) multinomial samples of size N
with probabilities 1/N.  Then, each sample is used in turn as the weight
vector to update the original tree fit.  Finally, all the updated trees are
combined to produce "consensus" class predictions.

Now, a typical realization of a multinomial sample consists of small
integers and several 0's.  I thought that the way that weighting worked was
this:  the observations with weights equal to 0 are omitted and the
observations with weights > 1 are essentially replicated according to the
weight.  So I thought that instead of running the rpart procedure with
weights, say, starting with (1, 0, 2, 0, 1, ... etc.)  I could simply
generate a sample data set by retaining row 1, omitting row 2, replicating
row 3 twice, omitting row 4, retaining row 5, etc.  However, this does not
seem to work as I expected.  Instead of getting identical trees (from
running weighted rpart on the original data set and running rpart on the
sample data set described above with no weighting) I get trees that are
completely different (different threshold values and different order of
variables entering the splits).  Moreover,  the predictions from these
trees can be different so the misclassification rates usually differ.

This finally brings me to my question - is there a way to mimic the
workings of the weighting in rpart by, for example, modification of the
data set or, perhaps, some other means.

Thanks in advance for your time,

Andy

__________________________________
Andy Jaworski
518-1-01
Process Laboratory
3M Corporate Research Laboratory
-----
E-mail: apjaworski at mmm.com
Tel:  (651) 733-6092
Fax:  (651) 736-3122

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

Rpart and bagging - how is it done?

Thread (5 messages)