Floating point "fuzz" and rpart?

Wed, Jul 25, 2001 12:54 PM #

I've been using rpart with R (1.3.0 Windows) for some time.  I recently ran 
one of my research data sets through the rpart routine and produced a 
classification tree.  I tried to replicate the results of the rpart 
analysis on another machine of mine and discovered some startling 
differences in the results.  Puzzled, I went back to the raw data residing 
on both machines.  I printed out both versions of the data, ran summary 
statistics, plotted histograms, boxplots, and anything else I could think 
of.  On the surface, the datasets are identical.  Since the file attributes 
were completely different, I know that the two versions may have originated 
from the same source but had been moved to R via different 
mechanisms.  Ultimately, I read both files in as .csv tables using read.csv().

Perplexed I gave the files different names and read both into a single 
version of R 1.3.0.  I ran rpart on each file and got the same results as 
when I ran the two files on separate machines.  So, I decided to do 
variable-by-variable comparisons using the all.equal.numeric() function.

On one machine, all.equal.numeric() returns TRUE for the same set of 
variables in both files, while on the second machine 9 of 10 variables 
return answers like the following (all are approximately 2.6......e-07):

"Mean relative  difference: 2.628787e-07"

So, clearly the two "identical files" are different somewhere in the outer 
reaches of floating point representation.  (The two machines are identical 
Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM).

Questions:

1.  Both machines have the same versions of R (with default options) and 
rpart, and I used one machine to propagate duplicate copies of each file to 
the other machine.  Why would one machine report all.equal.numeric() to be 
TRUE for all variables, while the other machine report 9 of 10 different in 
the outer floating point regions?  (Interestingly enough, the one variable 
reported to be "exactly" equal is the only variable of 10 recorded to the 
nearest "integer" - although stored as a floating point number; the other 9 
were measured in mm and recorded to the nearest tenth of a mm.)

2.  Even with differences only beyond the 7th decimal place, why would 
rpart report such demonstrably different results with the "same" data 
set?  Does floating point "fuzz" really make that much 
difference?  (rhetorical question!   The answer is obvious here.)

Thoughts, insights, suggestions for further explorations welcome.

Thanks.








=====================
Dr. Marc R. Feldesman
Professor and Chairman
Anthropology Department
Portland State University
1721 SW Broadway
Portland, Oregon 97201
email:  feldesmanm at pdx.edu
phone:  503-725-3081
fax:    503-725-3905
http://web.pdx.edu/~h1mf
PGP Key Available On Request
======================

"Anyway, no drug, not even alcohol, causes the fundamental ills of society.
If we're looking for the source of our troubles, we shouldn't test people
for drugs, we should test them for stupidity, ignorance, greed and love of
power."   P.J. O'Rourke

Powered by Optiplochoerus and Windows 2000 (scary isn't it?)

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

Brian Ripley

Thu, Jul 26, 2001 12:12 AM #

On Wed, 25 Jul 2001, Marc Feldesman wrote:

I've been using rpart with R (1.3.0 Windows) for some time.  I recently ran
one of my research data sets through the rpart routine and produced a
classification tree.  I tried to replicate the results of the rpart
analysis on another machine of mine and discovered some startling
differences in the results.  Puzzled, I went back to the raw data residing
on both machines.  I printed out both versions of the data, ran summary
statistics, plotted histograms, boxplots, and anything else I could think
of.  On the surface, the datasets are identical.  Since the file attributes
were completely different, I know that the two versions may have originated
from the same source but had been moved to R via different
mechanisms.  Ultimately, I read both files in as .csv tables using read.csv().

Perplexed I gave the files different names and read both into a single
version of R 1.3.0.  I ran rpart on each file and got the same results as
when I ran the two files on separate machines.  So, I decided to do
variable-by-variable comparisons using the all.equal.numeric() function.

On one machine, all.equal.numeric() returns TRUE for the same set of
variables in both files, while on the second machine 9 of 10 variables
return answers like the following (all are approximately 2.6......e-07):

"Mean relative  difference: 2.628787e-07"

So, clearly the two "identical files" are different somewhere in the outer
reaches of floating point representation.  (The two machines are identical
Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM).

Questions:

1.  Both machines have the same versions of R (with default options) and
rpart, and I used one machine to propagate duplicate copies of each file to
the other machine.  Why would one machine report all.equal.numeric() to be
TRUE for all variables, while the other machine report 9 of 10 different in
the outer floating point regions?  (Interestingly enough, the one variable
reported to be "exactly" equal is the only variable of 10 recorded to the
nearest "integer" - although stored as a floating point number; the other 9
were measured in mm and recorded to the nearest tenth of a mm.)

Windows has a load of DLLs providing the run-time system, notably
msvcrt.dll.  I suspect different versions of msvcrt.dll.

That is a little surprising (because most of rpart is in double precision
in R, single precision in S).  But it does make differences to `unstable'
methods (in Breiman's terminology) and CART is one of the most unstable
(hence bagging).

I should say that rpart_3.0-0 (the version in R 1.3.0) has a few problems
(as the first of a new major revision), although I am not aware of anything
giving incorrect results outside the survival area (where the author
convinced himself the new results were right, and has now changed his
mind).  He is getting all the new features in now, in anticipation of rpart
shipping with S-PLUS.

Brian

Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._