Skip to content
Prev 138877 / 398506 Next

Rpart and bagging - how is it done?

I would like to thank Brian Ripley and Torsten Hothorn for their quick and
thoughtful responses.

I rerun the example given by Professor Ripley by just starting R and
sourcing the code below and I got slightly different results.  Then I ran
it again setting the random seed before the sample command and I got
identical results a few times.  However, I found the example below that
seems to be a reproducible on my system (Win200 Pro, CoreDuo Xeon about a
year old).   I get the same results in 2.6.2 (patched March 4) and 2.7.0
(version of February 28).  Both were compiled from the tarballs in Cygwin
and up-to-date Rtools with no errors.  I just ran "make fullcheck" on 2.6.2
and it passes with no problems (just usual stuff - network conectivity
fails due to our firewall and slight numercial differences in a few cases.
The results from the rpart test are attached included at the bottom of this
post.

set.seed(123)
library(rpart)
ind <- sample(1:81, replace=TRUE)
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
       weights=tabulate(ind, nbins=81), xval=0)

Here is what I get:
n= 81

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 81 14 absent (0.8271605 0.1728395) *
+        weights=tabulate(ind, nbins=81), xval=0)
n= 81

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 81 14 absent (0.8271605 0.1728395)
   2) Start>=8.5 62  6 absent (0.9062500 0.0937500)
     4) Start>=14.5 29  0 absent (1.0000000 0.0000000) *
     5) Start< 14.5 33  6 absent (0.8000000 0.2000000)
      10) Age< 55 12  0 absent (1.0000000 0.0000000) *
      11) Age>=55 21  6 absent (0.6000000 0.4000000)
        22) Age>=111 14  2 absent (0.8000000 0.2000000) *
        23) Age< 111 7  1 present (0.2000000 0.8000000) *
   3) Start< 8.5 19  8 absent (0.5294118 0.4705882) *

The trees are dramatically different (the first one is just a root).  The
predictions are of course different (the first model predicts all cases as
absent) but the total number of misclassified observations differs by only
1 (17 vs. 16).

Can anyone reproduce this, or is something wrong with my system?

Thanks again,

Andy

PS.  rpart version is 3.1-39

rpart results from "make fullcheck"

-------- Testing package rpart --------
Massaging examples into 'rpart-Ex.R' ...
Running examples in 'rpart-Ex.R' ...
Running specific tests
  Running `surv_test.R'
  Running `testall.R'
  Comparing `testall.Rout' to `testall.Rout.save' ...127c127
<       g2      < 22.77 to the right, improve=6.8130, (6 missing)
---
159c159
<       g2      < 22.77 to the right, improve=4.8340, (6 missing)
---
193c193
<       grade < 3.5   to the left,  agree=0.772, adj=0.188, (0 split)
---
199c199
<       g2      < 13.47 to the left,  improve=3.55300, (0 missing)
---
241c241
<  1) root 146 53.420  5.893e-18
---
275c275
<   mean=5.893e-18, MSE=0.3659
---
346c346
<       g2      < 13.47 to the left,  improve=4.238e-02, (3 missing)
---
375c375
<       g2      < 17.91 to the right, improve=0.1271000, (1 missing)
---
515c515
<       g2      < 13.47 to the left,  improve=1.94600, (3 missing)
---
555c555
<       g2      < 17.91 to the right, improve=3.122000, (1 missing)
---
647c647
<       life       < 70.25 to the right, improve=0.25230, (0 missing)
---
OK
  Running `usersplits.R'
  Comparing `usersplits.Rout' to `usersplits.Rout.save' ...174c174
< Timing ratio =  3.2
---
OK

__________________________________
Andy Jaworski
518-1-01
Process Laboratory
3M Corporate Research Laboratory
-----
E-mail: apjaworski at mmm.com
Tel:  (651) 733-6092
Fax:  (651) 736-3122


                                                                           
             Prof Brian Ripley                                             
             <ripley at stats.ox.                                             
             ac.uk>                                                     To 
                                       apjaworski at mmm.com                  
             03/07/2008 03:11                                           cc 
             AM                        Torsten.Hothorn at R-project.org       
                                       R-help at R-project.org                
                                                                   Subject 
                                       Re: [R] Rpart and bagging - how is  
                                       it done?                            
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           
                                                                           




I believe that the procedure you describe at the end (resampling the
cases) is the original interpretation of bagging, and that using weighting
is equivalent when a procedure uses case weights.

If you are getting different results when replicating cases and when using
weights then rpart is not using its weights strictly as case weights and
it would be preferable to replicate cases.  But I am getting identical
predictions by the two routes:

ind <- sample(1:81, replace=TRUE)
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis[ind,], xval=0)
rpart(Kyphosis ~ Age + Number + Start, data=kyphosis,
       weights=tabulate(ind, nbins=81), xval=0)

My memory is that rpart uses unweighted numbers for its control params
(unlike tree) and hence is not strictly using case weights.  I believe you
can avoid that by setting the control params to their minimum and relying
on pruning.

BTW, it is inaccurate to call these trees 'non-pruned' -- the default
setting of cp is still (potentially) doing quite a lot of pruning.

Torsten Hothorn can explain why he chose to do what he did.  There's a
small (but only small) computational advantage in using case weights, but
the tricky issue for me is how precisely tree growth is stopped, and I
don't think that rpart at its default settings is mimicing what Breiman
was doing (he would have been growing much larger trees).
On Thu, 6 Mar 2008, apjaworski at mmm.com wrote:

            
but
rather
8
procedure.
from
are
was
replicating
not
http://www.R-project.org/posting-guide.html
--
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595