I've been using rpart with R (1.3.0 Windows) for some time. I recently ran one of my research data sets through the rpart routine and produced a classification tree. I tried to replicate the results of the rpart analysis on another machine of mine and discovered some startling differences in the results. Puzzled, I went back to the raw data residing on both machines. I printed out both versions of the data, ran summary statistics, plotted histograms, boxplots, and anything else I could think of. On the surface, the datasets are identical. Since the file attributes were completely different, I know that the two versions may have originated from the same source but had been moved to R via different mechanisms. Ultimately, I read both files in as .csv tables using read.csv(). Perplexed I gave the files different names and read both into a single version of R 1.3.0. I ran rpart on each file and got the same results as when I ran the two files on separate machines. So, I decided to do variable-by-variable comparisons using the all.equal.numeric() function. On one machine, all.equal.numeric() returns TRUE for the same set of variables in both files, while on the second machine 9 of 10 variables return answers like the following (all are approximately 2.6......e-07): "Mean relative difference: 2.628787e-07" So, clearly the two "identical files" are different somewhere in the outer reaches of floating point representation. (The two machines are identical Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM). Questions: 1. Both machines have the same versions of R (with default options) and rpart, and I used one machine to propagate duplicate copies of each file to the other machine. Why would one machine report all.equal.numeric() to be TRUE for all variables, while the other machine report 9 of 10 different in the outer floating point regions? (Interestingly enough, the one variable reported to be "exactly" equal is the only variable of 10 recorded to the nearest "integer" - although stored as a floating point number; the other 9 were measured in mm and recorded to the nearest tenth of a mm.) 2. Even with differences only beyond the 7th decimal place, why would rpart report such demonstrably different results with the "same" data set? Does floating point "fuzz" really make that much difference? (rhetorical question! The answer is obvious here.) Thoughts, insights, suggestions for further explorations welcome. Thanks. ===================== Dr. Marc R. Feldesman Professor and Chairman Anthropology Department Portland State University 1721 SW Broadway Portland, Oregon 97201 email: feldesmanm at pdx.edu phone: 503-725-3081 fax: 503-725-3905 http://web.pdx.edu/~h1mf PGP Key Available On Request ====================== "Anyway, no drug, not even alcohol, causes the fundamental ills of society. If we're looking for the source of our troubles, we shouldn't test people for drugs, we should test them for stupidity, ignorance, greed and love of power." P.J. O'Rourke Powered by Optiplochoerus and Windows 2000 (scary isn't it?) -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
Floating point "fuzz" and rpart?
2 messages · Marc Feldesman, Brian Ripley
On Wed, 25 Jul 2001, Marc Feldesman wrote:
I've been using rpart with R (1.3.0 Windows) for some time. I recently ran one of my research data sets through the rpart routine and produced a classification tree. I tried to replicate the results of the rpart analysis on another machine of mine and discovered some startling differences in the results. Puzzled, I went back to the raw data residing on both machines. I printed out both versions of the data, ran summary statistics, plotted histograms, boxplots, and anything else I could think of. On the surface, the datasets are identical. Since the file attributes were completely different, I know that the two versions may have originated from the same source but had been moved to R via different mechanisms. Ultimately, I read both files in as .csv tables using read.csv(). Perplexed I gave the files different names and read both into a single version of R 1.3.0. I ran rpart on each file and got the same results as when I ran the two files on separate machines. So, I decided to do variable-by-variable comparisons using the all.equal.numeric() function. On one machine, all.equal.numeric() returns TRUE for the same set of variables in both files, while on the second machine 9 of 10 variables return answers like the following (all are approximately 2.6......e-07): "Mean relative difference: 2.628787e-07" So, clearly the two "identical files" are different somewhere in the outer reaches of floating point representation. (The two machines are identical Dell PIII, XPS T700's - one has 256MB RAM, the other 512MB RAM). Questions: 1. Both machines have the same versions of R (with default options) and rpart, and I used one machine to propagate duplicate copies of each file to the other machine. Why would one machine report all.equal.numeric() to be TRUE for all variables, while the other machine report 9 of 10 different in the outer floating point regions? (Interestingly enough, the one variable reported to be "exactly" equal is the only variable of 10 recorded to the nearest "integer" - although stored as a floating point number; the other 9 were measured in mm and recorded to the nearest tenth of a mm.)
Windows has a load of DLLs providing the run-time system, notably msvcrt.dll. I suspect different versions of msvcrt.dll.
2. Even with differences only beyond the 7th decimal place, why would rpart report such demonstrably different results with the "same" data set? Does floating point "fuzz" really make that much difference? (rhetorical question! The answer is obvious here.)
That is a little surprising (because most of rpart is in double precision in R, single precision in S). But it does make differences to `unstable' methods (in Breiman's terminology) and CART is one of the most unstable (hence bagging). I should say that rpart_3.0-0 (the version in R 1.3.0) has a few problems (as the first of a new major revision), although I am not aware of anything giving incorrect results outside the survival area (where the author convinced himself the new results were right, and has now changed his mind). He is getting all the new features in now, in anticipation of rpart shipping with S-PLUS. Brian
Brian D. Ripley, ripley at stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272860 (secr) Oxford OX1 3TG, UK Fax: +44 1865 272595 -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html Send "info", "help", or "[un]subscribe" (in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._