what does cut(data, breaks=n) actually do?
Peter Dalgaard wrote:
melissa cline wrote:
Hello, I'm trying to bin a quantity into 2-3 bins for calculating entropy and mutual information. One of the approaches I'm exploring is the cut() function, which is what the mutualInfo function in binDist uses. When it's called in the format cut(data, breaks=n), it somehow splits the data into n distinct bins. Can anyone tell me how cut() decides where to cut?
This is one case where reading the actual R code is easier that
explaining what it does. From cut.default
if (length(breaks) == 1) {
if (is.na(breaks) | breaks < 2)
stop("invalid number of intervals")
nb <- as.integer(breaks + 1)
dx <- diff(rx <- range(x, na.rm = TRUE))
if (dx == 0)
dx <- rx[1]
breaks <- seq.int(rx[1] - dx/1000, rx[2] + dx/1000, length.out = nb)
}
so basically it takes the range, extends it a bit and splits in into
<breaks> equally long segments.
(For the sometimes more attractive option of splitting into groups of
roughly equal size, there is cut2 in the Hmisc package, or use quantile())
It can be a bit dangerous to use quantile() to provide breaks for cut(), because quantiles can be non-unique, which cut() doesn't like:
x1 <- c(1,1,1,1,1,1,1,1,1,2) cut(x1, breaks=quantile(x1, (0:2)/2))
Error in cut.default(x1, breaks = quantile(x1, (0:2)/2)) : 'breaks' are not unique
However, cut2() in Hmisc handles this situation gracefully:
library(Hmisc)
Attaching package: 'Hmisc'
The following object(s) are masked from package:base :
format.pval,
round.POSIXt,
trunc.POSIXt,
units
cut2(x1, g=2)
[1] 1 1 1 1 1 1 1 1 1 2 Levels: 1 2
(Additionally, a potentially dangerous peculiarity of quantile() for this kind of purpose is that its return values can be out of order (i.e., diff(quantile(...))<0, at rounding error level), but this doesn't actually upset cut() in R because cut() sorts the breaks prior to using them.) -- Tony Plate