Dear R People:
I have a question about a "sorting" problem, please.
I have a vector xx:
xx
[1] -2.0 1.4 -1.2 -2.2 0.4 1.5 -2.2 0.2 -0.4 -0.9
and a vector of breaks:
xx.y
[1] -2.2000000 -0.9666667 0.2666667 1.5000000
I want to produce another vector z which contains the number of the class
that each data point is in.
for instance, xx[1] is between xx.y[1] and xx.y[2], so z[1] == 1
this can be accomplished via loops, but I was wondering if there is a more
efficient method, please.
By the way, eventually, there will be many more data points and more
classes.
thank you for any help!
sincerely,
Erin Hodgess
mailto: hodgesse at uhd.edu
Version 1.7.0 R for Windows
Dear R People:
I have a question about a "sorting" problem, please.
I have a vector xx:
xx
[1] -2.0 1.4 -1.2 -2.2 0.4 1.5 -2.2 0.2 -0.4 -0.9
and a vector of breaks:
xx.y
[1] -2.2000000 -0.9666667 0.2666667 1.5000000
I want to produce another vector z which contains the number of the class
that each data point is in.
for instance, xx[1] is between xx.y[1] and xx.y[2], so z[1] == 1
this can be accomplished via loops, but I was wondering if there is a more
efficient method, please.
By the way, eventually, there will be many more data points and more
classes.
Erin, even though you've already summarized,
I think the optimal answer to your question is
findInterval()
{there's also R-C API you can use from your C/C++}
Martin
"Erin" == Erin Hodgess <hodgess at uhddx01.dt.uh.edu>
on Thu, 12 Jun 2003 13:33:52 -0500 (CDT) writes:
Erin> Dear R People: I have a question about a "sorting"
Erin> problem, please.
Erin> I have a vector xx:
>> xx
Erin> [1] -2.0 1.4 -1.2 -2.2 0.4 1.5 -2.2 0.2 -0.4 -0.9
Erin> and a vector of breaks:
>> xx.y
Erin> [1] -2.2000000 -0.9666667 0.2666667 1.5000000
Erin> I want to produce another vector z which contains the
Erin> number of the class that each data point is in.
Erin> for instance, xx[1] is between xx.y[1] and xx.y[2], so
Erin> z[1] == 1
Erin> this can be accomplished via loops, but I was
Erin> wondering if there is a more efficient method, please.
Erin> By the way, eventually, there will be many more data
Erin> points and more classes.
Erin> thank you for any help!
Erin> sincerely, Erin Hodgess mailto: hodgesse at uhd.edu
Erin> Version 1.7.0 R for Windows
Erin> ______________________________________________
Erin> R-help at stat.math.ethz.ch mailing list
Erin> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Martin Maechler <maechler at stat.math.ethz.ch> wrote:
findInterval()
Hi, Martin. I wasn't aware of findInterval(). findInterval(x, vec) looks to
me very similar to:
R> cut(x, c(-Inf,vec,Inf), labels=FALSE, right=FALSE) - 1
so I'm curious what the differences are (e.g. speed, duplicates in vec?). In
any case, findInterval() and cut() ought to be in each other's "See Also",
don't you think?
R> xx <- c(-2.0, 1.4, -1.2, -2.2, 0.4, 1.5, -2.2, 0.2, -0.4, -0.9)
R> xx.y <- c(-2.2000000, -0.9666667, 0.2666667, 1.5000000)
R> findInterval(xx, xx.y)
[1] 1 3 1 1 3 4 1 2 2 2
R> cut(xx, c(-Inf,xx.y,Inf), labels=FALSE, right=FALSE) - 1
[1] 1 3 1 1 3 4 1 2 2 2
"DavidB" == David Brahm <brahm at alum.mit.edu>
on Fri, 13 Jun 2003 10:56:29 -0400 writes:
DavidB> Martin Maechler <maechler at stat.math.ethz.ch> wrote:
>> findInterval()
DavidB> Hi, Martin. I wasn't aware of findInterval(). findInterval(x, vec) looks to
DavidB> me very similar to:
R> cut(x, c(-Inf,vec,Inf), labels=FALSE, right=FALSE) - 1
DavidB> so I'm curious what the differences are (e.g. speed,
DavidB> duplicates in vec?). In any case, findInterval()
DavidB> and cut() ought to be in each other's "See Also",
DavidB> don't you think?
When I wrote the precursor of findInterval() about 10 years ago (to be
dyn.load()ed into S-plus), I hadn't yet realized about the
several alternatives.
However, when I added it to R, I knew about the N*ecdf()
alternative, i.e., ecdf() from package:stepfun which relies on
approx(....., method = "constant").
I found that findInterval() was slightly faster than approx()
even for unsorted `x' (by about a factor of 2 for large `vec') in my
test cases, but the real speed of findInterval() comes to play
when `x' is sorted -- something which is very typical e.g. for
evaluation of piecewise functions (splines etc).
R> xx <- c(-2.0, 1.4, -1.2, -2.2, 0.4, 1.5, -2.2, 0.2, -0.4, -0.9)
R> xx.y <- c(-2.2000000, -0.9666667, 0.2666667, 1.5000000)
R> findInterval(xx, xx.y)
DavidB> [1] 1 3 1 1 3 4 1 2 2 2
R> cut(xx, c(-Inf,xx.y,Inf), labels=FALSE, right=FALSE) - 1
DavidB> [1] 1 3 1 1 3 4 1 2 2 2
cut() is still slower than the ecdf() / approx() version
considerably for long `vec' ...
I really should write a small article about this for "R News",
where I'd also show the simulation results...
Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
phone: x-41-1-632-3408 fax: ...-1228 <><