Skip to content

what does cut(data, breaks=n) actually do?

3 messages · melissa cline, Peter Dalgaard, Domenico Vistocco

#
melissa cline wrote:
This is one case where reading the actual R code is easier that 
explaining what it does.  From cut.default

    if (length(breaks) == 1) {
        if (is.na(breaks) | breaks < 2)
            stop("invalid number of intervals")
        nb <- as.integer(breaks + 1)
        dx <- diff(rx <- range(x, na.rm = TRUE))
        if (dx == 0)
            dx <- rx[1]
        breaks <- seq.int(rx[1] - dx/1000, rx[2] + dx/1000, length.out = nb)
    }

so basically it takes the range, extends it a bit and splits in into 
<breaks> equally long segments.

(For the sometimes more attractive option of splitting into groups of 
roughly equal size, there is cut2 in the Hmisc package, or use quantile())
#
cut(data, breaks=n)
splits the data in n bins of (approximately) the same size.

The used size is obtained by:
max(data) - min(data)
------------------------------------
                 n

 > x=rnorm(x)
 > cut(x,breaks=3)
 [1] (1.79,9.97]  (-6.39,1.79] (9.97,18.2]  (9.97,18.2]  (-6.39,1.79]
 [6] (1.79,9.97]  (-6.39,1.79] (1.79,9.97]  (-6.39,1.79] (-6.39,1.79]
Levels: (-6.39,1.79] (1.79,9.97] (9.97,18.2]

Then you have:
 > 18.2-9.97
[1] 8.23
 > 9.97-1.79
[1] 8.18
 > 1.79+6.39
[1] 8.18
 >

 > (max(x)-min(x))/3
[1] 8.164187

I don't know the reasons for the little differences (I am wondering about).
I hope it is useful.
domenico
melissa cline wrote: