cut2 not binning interval endpoints correctly

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20131125/65b9a26e/attachment.pl>
FAQ 7.31

Sent from my iPad

Hi everyone,

I am attempting to bin a vector of numbers between 0 and 1 into intervals
of 0.001 but many values at the endpoints of the intervals are getting
binned into the wrong interval. For example, the first 3 rows are binned
incorrectly here:

library(Hmisc)
df=data.frame(x=c(0.308,0.422,0.174,0.04709))
df$bucket=cut2(df$x,seq(0,1,0.001),oneval=FALSE)
print(df)
       x        bucket
1 0.30800 [0.307,0.308)
2 0.42200 [0.421,0.422)
3 0.17400 [0.173,0.174)
4 0.04709 [0.047,0.048)

I have tried closing and reopening RStudio after clearing the workspace and
reinstalling the Hmisc package. I am running R version 3.0.2 on Windows 7
64-bit. Thank you.

   [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
-----Original Message-----
I am attempting to bin a vector of numbers between 0 and 1 into
intervals of 0.001 but many values at the endpoints of the intervals
are getting binned into the wrong interval. For example, the first 3
rows are binned incorrectly here:
From: Jim Holtman
FAQ 7.31

Maybe. But 

#and
0.308 == seq(0, 0.310, 0.001)[309]
# [1] TRUE

seems to suggest that while some oddities may be explained by finite precision, 0.308 is exactly represented by the cut sequence  here, so .308 should be OK.

#in addition, extending  the OP's example
df <- data.frame(x=c(0.308,0.422,0.174,0.04709))
df$bucket <- cut2(df$x,seq(0,1,0.001),oneval=FALSE)
df$cutR <- cut(df$x,seq(0,1,0.001),right=FALSE)
df

#         x        bucket          cutR
# 1 0.30800 [0.307,0.308) [0.308,0.309)
# 2 0.42200 [0.421,0.422) [0.422,0.423)
# 3 0.17400 [0.173,0.174) [0.173,0.174)
# 4 0.04709 [0.047,0.048) [0.047,0.048)

implies that cut2 is not doing the same thing as cut despite the same intended outcome (at least on R 3.0.1, my present version at work).

This may be one for Frank Harrell ...

S Ellison

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20131126/77853dd7/attachment.pl>
-----Original Message-----
jim holtman <jholtman at gmail.com>
You need to look at the full accuracy of the number representation:
Um... I think I did. But I'm not sure you did.... 
print(..., digits=20) has used different numbers of digits for your two print()s, probably because print() decided it needed more digits for the multi-valued vector. The internal representations were the same. Try

print(seq(0, 0.310, 0.001)[309], digits = 20)
[1] 0.307999999999999996

print(seq(0, 0.310, 0.001)[309], digits = 22)
[1] 0.3079999999999999960032
print(0.308, digits = 22)
[1] 0.3079999999999999960032

0.308 does match the cut boundary 'exactly' in this case (which is why the usually unwise '==' returned TRUE), though neither is exactly 0.308. 

Nonetheless, I understand that FAQ 7.31 is a good candidate for other 'unexpected' cut2 results. However, that isn't the whole story. It doesn't explain the corresponding cut(, right=FALSE) result, which should give the same answer as cut2 if finite representation were the sole cause. So there's summat else going on.

Steve E

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
You can look at the source code of Hmisc::cut2() to see what is going on -- it does
 a lot more than calling cut() with different default arguments.  Another
approach to debugging this is to use trace() to see what cut2() passes down
to the default cut method:
trace(cut.default, quote(cat("   x=", deparse(x), "\n   breaks=", deparse(breaks), "\n")))
Tracing function "cut.default" in package "base"
[1] "cut.default"
z <- cut2(c(0.30800), seq(0,1,0.001)[306:315], oneval=FALSE)
Tracing cut.default(x, k2) on entry 
   x= 0.308 
   breaks= c(0.3045, 0.3055, 0.3065, 0.3075, 0.3085, 0.3095, 0.3105, 0.3115,  0.3125, 0.314)
z
[1] [0.308,0.309)
9 Levels: [0.305,0.306) [0.306,0.307) [0.307,0.308) ... [0.313,0.314]

I.e., this has little to do with floating point errors in cut(). 

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
Of S Ellison
Sent: Wednesday, November 27, 2013 9:12 AM
To: r-help at r-project.org
Subject: Re: [R] cut2 not binning interval endpoints correctly

-----Original Message-----
jim holtman <jholtman at gmail.com>
You need to look at the full accuracy of the number representation:
Um... I think I did. But I'm not sure you did....
print(..., digits=20) has used different numbers of digits for your two print()s, probably
because print() decided it needed more digits for the multi-valued vector. The internal
representations were the same. Try

print(seq(0, 0.310, 0.001)[309], digits = 20)
[1] 0.307999999999999996

print(seq(0, 0.310, 0.001)[309], digits = 22)
[1] 0.3079999999999999960032

print(0.308, digits = 22)
[1] 0.3079999999999999960032

0.308 does match the cut boundary 'exactly' in this case (which is why the usually unwise
'==' returned TRUE), though neither is exactly 0.308.

Nonetheless, I understand that FAQ 7.31 is a good candidate for other 'unexpected' cut2
results. However, that isn't the whole story. It doesn't explain the corresponding cut(,
right=FALSE) result, which should give the same answer as cut2 if finite representation
were the sole cause. So there's summat else going on.

Steve E

*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.