An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110404/b8024964/attachment.pl>
General binary search?
5 messages · William Dunlap, Matt Dowle, Martin Morgan +1 more
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Stavros Macrakis Sent: Monday, April 04, 2011 1:15 PM To: r-help Subject: [R] General binary search? Is there a generic binary search routine in a standard library which a) works for character vectors b) runs in O(log(N)) time? I'm aware of findInterval(x,vec), but it is restricted to numeric vectors.
xtfrm(x) will convert a character (or other) vector to
a numeric vector with the same ordering. findInterval
can work on that. E.g.,
> f0 <- function(x, vec) {
tmp <- xtfrm(c(x, vec))
findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
}
> f0(c("Baby", "Aunt", "Dog"), LETTERS)
[1] 2 1 4
I've never looked at its speed.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and fastmatch), but I need the greatest-lower-bound match in my application. findInterval is also slow for large N=length(vec) because of the O(N) checking it does, as Duncan Murdoch has pointed out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: though its documentation says it runs in O(n * log(N)), it actually runs in O(n * log(N) + N), which is quite noticeable for largish N. But that is easy enough to work around by writing a variant of findInterval which calls find_interv_vec without checking. -s PS Yes, binary search is a one-liner in R, but I always prefer to use standard, fast native libraries when possible.... binarysearch <- function(val,tab,L,H) {while (H>=L) { M=L+(H-L) %/% 2; if (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; return(L-1)} [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Try data.table:::sortedmatch, which is implemented in C. It requires it's input to be sorted (and doesn't check) "Stavros Macrakis" <macrakis at alum.mit.edu> wrote in message news:BANLkTi=j2LF5SyXYTV1Dd4k9wr0ZGK88ig at mail.gmail.com...
Is there a generic binary search routine in a standard library which a) works for character vectors b) runs in O(log(N)) time? I'm aware of findInterval(x,vec), but it is restricted to numeric vectors. I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and fastmatch), but I need the greatest-lower-bound match in my application. findInterval is also slow for large N=length(vec) because of the O(N) checking it does, as Duncan Murdoch has pointed out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: though its documentation says it runs in O(n * log(N)), it actually runs in O(n * log(N) + N), which is quite noticeable for largish N. But that is easy enough to work around by writing a variant of findInterval which calls find_interv_vec without checking. -s PS Yes, binary search is a one-liner in R, but I always prefer to use standard, fast native libraries when possible.... binarysearch <- function(val,tab,L,H) {while (H>=L) { M=L+(H-L) %/% 2; if (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; return(L-1)} [[alternative HTML version deleted]]
1 day later
On 04/04/2011 01:50 PM, William Dunlap wrote:
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Stavros Macrakis
Sent: Monday, April 04, 2011 1:15 PM
To: r-help
Subject: [R] General binary search?
Is there a generic binary search routine in a standard library which
a) works for character vectors
b) runs in O(log(N)) time?
I'm aware of findInterval(x,vec), but it is restricted to
numeric vectors.
xtfrm(x) will convert a character (or other) vector to a numeric vector with the same ordering. findInterval can work on that. E.g.,
> f0<- function(x, vec) {
tmp<- xtfrm(c(x, vec))
findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
}
> f0(c("Baby", "Aunt", "Dog"), LETTERS)
[1] 2 1 4 I've never looked at its speed.
For a little progress (though no 'generic binary searchin a standard
library'), here's the 'one-liner'
bsearch1 <-
function(val, tab, L=1L, H=length(tab))
{
while (H >= L) {
M <- L + (H - L) %/% 2L
if (tab[M] > val) H <- M - 1L
else if (tab[M] < val) L <- M + 1L
else return(M)
}
return(L - 1L)
}
It seems like a good candidate for the new (R-2.13) 'compiler' package, so
library(compiler)
bsearch2 <- cmpfun(bsearch1)
And Bill's suggestion
bsearch3 <- function(val, tab) {
tmp <- xtfrm(c(val, tab))
findInterval(tmp[seq_along(val)], tmp[-seq_along(val)])
}
which will work best when 'val' is a vector to be looked up.
A quick look at data.table:::sortedmatch seemed to return matches,
whereas Stavros is looking for lower bounds.
It seems that one could shift the weight more to C code by 'vectorizing'
the one-liner, first as
bsearch5 <-
function(val, tab, L=1L, H=length(tab))
{
b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
i0 <- seq_along(val)
repeat {
M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
i <- tab[M] > val[i0]
b[i0 + i * length(val)] <-
ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M))
i0 <- which(b[i0, "H"] >= b[i0, "L"])
if (!length(i0)) break;
}
b[,"L"] - 1L
}
and then a little more thoughtfully (though more room for improvement) as
bsearch7 <-
function(val, tab, L=1L, H=length(tab))
{
b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
i0 <- seq_along(val)
repeat {
updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
tabM <- tab[M]
val0 <- val[i0]
i <- tabM < val0
updt[i] <- M[i] + 1L
i <- tabM > val0
updt[i] <- M[i] - 1L
b[i0 + i * length(val)] <- updt
i0 <- which(b[i0, "H"] >= b[i0, "L"])
if (!length(i0)) break;
}
b[,"L"] - 1L
}
none of bsearch 3, 5, or 7 is likely to benefit substantially from
compilation.
Here's a little test data set converting numeric to character as an easy
cheat.
set.seed(123L)
x <- sort(as.character(rnorm(1e6)))
y <- as.character(rnorm(1e4))
There seems to be some significant initial overhead, so we warm things
up (and also introduce the paradigm for multiple look-ups in bsearch 1, 2)
warmup <- function(y, x) {
lapply(y, bsearch1, x)
lapply(y, bsearch2, x)
bsearch3(y, x)
bsearch5(y, x)
bsearch7(y, x)
}
replicate(3, warmup(y, x))
and then time
> system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE))
user system elapsed
2.692 0.000 2.696
> system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE))
user system elapsed
1.379 0.000 1.380
> identical(res1, res2)
[1] TRUE
> system.time(res3 <- bsearch3(y, x)); identical(res1, res3)
user system elapsed
8.339 0.001 8.350
[1] TRUE
> system.time(res5 <- bsearch5(y, x)); identical(res1, res5)
user system elapsed
0.700 0.000 0.702
[1] TRUE
> system.time(res7 <- bsearch7(y, x)); identical(res1, res7)
user system elapsed
0.222 0.000 0.222
[1] TRUE
Martin
Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com
I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and fastmatch), but I need the greatest-lower-bound match in my application. findInterval is also slow for large N=length(vec) because of the O(N) checking it does, as Duncan Murdoch has pointed out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>: though its documentation says it runs in O(n * log(N)), it actually runs in O(n * log(N) + N), which is quite noticeable for largish N. But that is easy enough to work around by writing a variant of findInterval which calls find_interv_vec without checking. -s PS Yes, binary search is a one-liner in R, but I always prefer to use standard, fast native libraries when possible.... binarysearch<- function(val,tab,L,H) {while (H>=L) { M=L+(H-L) %/% 2; if (tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)}; return(L-1)} [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110407/022220b5/attachment.pl>