Skip to content

breaks

5 messages · Erin Hodgess, Sundar Dorai-Raj, Martin Maechler +1 more

#
Dear R People:

I have a question about a "sorting" problem, please.

I have a vector xx:
[1] -2.0  1.4 -1.2 -2.2  0.4  1.5 -2.2  0.2 -0.4 -0.9

and a vector of breaks:
[1] -2.2000000 -0.9666667  0.2666667  1.5000000

I want to produce another vector z which contains the number of the class
that each data point is in.

for instance, xx[1] is between xx.y[1] and xx.y[2], so z[1] == 1

this can be accomplished via loops, but I was wondering if there is a more
efficient method, please.

By the way, eventually, there will be many more data points and more
classes.

thank you for any help!

sincerely,
Erin Hodgess
mailto: hodgesse at uhd.edu

Version 1.7.0 R for Windows
#
Erin Hodgess wrote:
I think what you're looking for is ?cut:

R> xx = c(-2.0,  1.4, -1.2, -2.2,  0.4,  1.5, -2.2,  0.2, -0.4, -0.9)
R> cut(xx, breaks = c(-Inf, -2.2, -0.97, 0.27, 1.5, Inf))
  [1] (-2.2,-0.97] (0.27,1.5]   (-2.2,-0.97] (-Inf,-2.2]  (0.27,1.5]
  [6] (0.27,1.5]   (-Inf,-2.2]  (-0.97,0.27] (-0.97,0.27] (-0.97,0.27]
Levels: (-Inf,-2.2] (-2.2,-0.97] (-0.97,0.27] (0.27,1.5] (1.5,Inf]
R>

Regards,
Sundar
#
Erin, even though you've already summarized,
I think the optimal answer to your question is

  findInterval()

{there's also R-C API you can use from your  C/C++}

Martin
Erin> Dear R People: I have a question about a "sorting"
    Erin> problem, please.

    Erin> I have a vector xx:

    >> xx

    Erin>  [1] -2.0 1.4 -1.2 -2.2 0.4 1.5 -2.2 0.2 -0.4 -0.9

    Erin> and a vector of breaks:

    >> xx.y

    Erin> [1] -2.2000000 -0.9666667 0.2666667 1.5000000

    Erin> I want to produce another vector z which contains the
    Erin> number of the class that each data point is in.

    Erin> for instance, xx[1] is between xx.y[1] and xx.y[2], so
    Erin> z[1] == 1

    Erin> this can be accomplished via loops, but I was
    Erin> wondering if there is a more efficient method, please.

    Erin> By the way, eventually, there will be many more data
    Erin> points and more classes.

    Erin> thank you for any help!

    Erin> sincerely, Erin Hodgess mailto: hodgesse at uhd.edu

    Erin> Version 1.7.0 R for Windows

    Erin> ______________________________________________
    Erin> R-help at stat.math.ethz.ch mailing list
    Erin> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
#
Martin Maechler <maechler at stat.math.ethz.ch> wrote:
Hi, Martin.  I wasn't aware of findInterval().  findInterval(x, vec) looks to
me very similar to:
  R> cut(x, c(-Inf,vec,Inf), labels=FALSE, right=FALSE) - 1
so I'm curious what the differences are (e.g. speed, duplicates in vec?).  In
any case, findInterval() and cut() ought to be in each other's "See Also",
don't you think?

R> xx <- c(-2.0, 1.4, -1.2, -2.2, 0.4, 1.5, -2.2, 0.2, -0.4, -0.9)
R> xx.y <- c(-2.2000000, -0.9666667, 0.2666667, 1.5000000)
R> findInterval(xx, xx.y)
   [1] 1 3 1 1 3 4 1 2 2 2
R> cut(xx, c(-Inf,xx.y,Inf), labels=FALSE, right=FALSE) - 1
   [1] 1 3 1 1 3 4 1 2 2 2
#

        
DavidB> Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>> findInterval()

    DavidB> Hi, Martin.  I wasn't aware of findInterval().  findInterval(x, vec) looks to
    DavidB> me very similar to:
    R> cut(x, c(-Inf,vec,Inf), labels=FALSE, right=FALSE) - 1

    DavidB> so I'm curious what the differences are (e.g. speed,
    DavidB> duplicates in vec?).  In any case, findInterval()
    DavidB> and cut() ought to be in each other's "See Also",
    DavidB> don't you think?

When I wrote the precursor of findInterval() about 10 years ago (to be
dyn.load()ed into S-plus), I hadn't yet realized about the
several alternatives.  

However, when I added it to R, I knew about the N*ecdf()
alternative, i.e., ecdf() from package:stepfun which relies on
approx(....., method = "constant").
I found that findInterval() was slightly faster than approx()
even for unsorted `x' (by about a factor of 2 for large `vec') in my
test cases, but the real speed of findInterval() comes to play
when `x' is sorted -- something which is very typical e.g. for
evaluation of piecewise functions (splines etc).

    R> xx <- c(-2.0, 1.4, -1.2, -2.2, 0.4, 1.5, -2.2, 0.2, -0.4, -0.9)
    R> xx.y <- c(-2.2000000, -0.9666667, 0.2666667, 1.5000000)
    R> findInterval(xx, xx.y)
    DavidB> [1] 1 3 1 1 3 4 1 2 2 2
    R> cut(xx, c(-Inf,xx.y,Inf), labels=FALSE, right=FALSE) - 1
    DavidB> [1] 1 3 1 1 3 4 1 2 2 2

cut() is still slower than the ecdf() / approx() version
considerably for long `vec'  ...
I really should write a small article about this for "R News",
where I'd also show the simulation results...

Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
phone: x-41-1-632-3408		fax: ...-1228			<><