An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20131125/65b9a26e/attachment.pl>
cut2 not binning interval endpoints correctly
6 messages · Maximilian Butler, S Ellison, jim holtman +1 more
FAQ 7.31 Sent from my iPad
On Nov 25, 2013, at 9:01, Maximilian Butler <maximilian.butler at gmail.com> wrote:
Hi everyone,
I am attempting to bin a vector of numbers between 0 and 1 into intervals
of 0.001 but many values at the endpoints of the intervals are getting
binned into the wrong interval. For example, the first 3 rows are binned
incorrectly here:
library(Hmisc)
df=data.frame(x=c(0.308,0.422,0.174,0.04709))
df$bucket=cut2(df$x,seq(0,1,0.001),oneval=FALSE)
print(df)
x bucket
1 0.30800 [0.307,0.308)
2 0.42200 [0.421,0.422)
3 0.17400 [0.173,0.174)
4 0.04709 [0.047,0.048)
I have tried closing and reopening RStudio after clearing the workspace and
reinstalling the Hmisc package. I am running R version 3.0.2 on Windows 7
64-bit. Thank you.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-----Original Message-----
I am attempting to bin a vector of numbers between 0 and 1 into intervals of 0.001 but many values at the endpoints of the intervals are getting binned into the wrong interval. For example, the first 3 rows are binned incorrectly here:
From: Jim Holtman FAQ 7.31
Maybe. But
#and
0.308 == seq(0, 0.310, 0.001)[309]
# [1] TRUE
seems to suggest that while some oddities may be explained by finite precision, 0.308 is exactly represented by the cut sequence here, so .308 should be OK.
#in addition, extending the OP's example
df <- data.frame(x=c(0.308,0.422,0.174,0.04709))
df$bucket <- cut2(df$x,seq(0,1,0.001),oneval=FALSE)
df$cutR <- cut(df$x,seq(0,1,0.001),right=FALSE)
df
# x bucket cutR
# 1 0.30800 [0.307,0.308) [0.308,0.309)
# 2 0.42200 [0.421,0.422) [0.422,0.423)
# 3 0.17400 [0.173,0.174) [0.173,0.174)
# 4 0.04709 [0.047,0.048) [0.047,0.048)
implies that cut2 is not doing the same thing as cut despite the same intended outcome (at least on R 3.0.1, my present version at work).
This may be one for Frank Harrell ...
S Ellison
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20131126/77853dd7/attachment.pl>
1 day later
-----Original Message----- jim holtman <jholtman at gmail.com> You need to look at the full accuracy of the number representation:
Um... I think I did. But I'm not sure you did.... print(..., digits=20) has used different numbers of digits for your two print()s, probably because print() decided it needed more digits for the multi-valued vector. The internal representations were the same. Try print(seq(0, 0.310, 0.001)[309], digits = 20) [1] 0.307999999999999996 print(seq(0, 0.310, 0.001)[309], digits = 22) [1] 0.3079999999999999960032
print(0.308, digits = 22)
[1] 0.3079999999999999960032
0.308 does match the cut boundary 'exactly' in this case (which is why the usually unwise '==' returned TRUE), though neither is exactly 0.308.
Nonetheless, I understand that FAQ 7.31 is a good candidate for other 'unexpected' cut2 results. However, that isn't the whole story. It doesn't explain the corresponding cut(, right=FALSE) result, which should give the same answer as cut2 if finite representation were the sole cause. So there's summat else going on.
Steve E
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
You can look at the source code of Hmisc::cut2() to see what is going on -- it does a lot more than calling cut() with different default arguments. Another approach to debugging this is to use trace() to see what cut2() passes down to the default cut method:
trace(cut.default, quote(cat(" x=", deparse(x), "\n breaks=", deparse(breaks), "\n")))
Tracing function "cut.default" in package "base" [1] "cut.default"
z <- cut2(c(0.30800), seq(0,1,0.001)[306:315], oneval=FALSE)
Tracing cut.default(x, k2) on entry x= 0.308 breaks= c(0.3045, 0.3055, 0.3065, 0.3075, 0.3085, 0.3095, 0.3105, 0.3115, 0.3125, 0.314)
z
[1] [0.308,0.309) 9 Levels: [0.305,0.306) [0.306,0.307) [0.307,0.308) ... [0.313,0.314] I.e., this has little to do with floating point errors in cut(). Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of S Ellison Sent: Wednesday, November 27, 2013 9:12 AM To: r-help at r-project.org Subject: Re: [R] cut2 not binning interval endpoints correctly
-----Original Message----- jim holtman <jholtman at gmail.com> You need to look at the full accuracy of the number representation:
Um... I think I did. But I'm not sure you did.... print(..., digits=20) has used different numbers of digits for your two print()s, probably because print() decided it needed more digits for the multi-valued vector. The internal representations were the same. Try print(seq(0, 0.310, 0.001)[309], digits = 20) [1] 0.307999999999999996 print(seq(0, 0.310, 0.001)[309], digits = 22) [1] 0.3079999999999999960032
print(0.308, digits = 22)
[1] 0.3079999999999999960032
0.308 does match the cut boundary 'exactly' in this case (which is why the usually unwise
'==' returned TRUE), though neither is exactly 0.308.
Nonetheless, I understand that FAQ 7.31 is a good candidate for other 'unexpected' cut2
results. However, that isn't the whole story. It doesn't explain the corresponding cut(,
right=FALSE) result, which should give the same answer as cut2 if finite representation
were the sole cause. So there's summat else going on.
Steve E
*******************************************************************
This email and any attachments are confidential. Any use...{{dropped:8}}
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.