General binary search?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110404/b8024964/attachment.pl>
-----Original Message-----
From: r-help-bounces at r-project.org 
[mailto:r-help-bounces at r-project.org] On Behalf Of Stavros Macrakis
Sent: Monday, April 04, 2011 1:15 PM
To: r-help
Subject: [R] General binary search?

Is there a generic binary search routine in a standard library which

   a) works for character vectors
   b) runs in O(log(N)) time?

I'm aware of findInterval(x,vec), but it is restricted to 
numeric vectors.
xtfrm(x) will convert a character (or other) vector to
a numeric vector with the same ordering.  findInterval
can work on that.  E.g., 
   > f0 <- function(x, vec) {
       tmp <- xtfrm(c(x, vec))
       findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
     }
   > f0(c("Baby", "Aunt", "Dog"), LETTERS)
   [1] 2 1 4
I've never looked at its speed.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
I'm also aware of various hashing solutions (e.g. 
new.env(hash=TRUE) and
fastmatch), but I need the greatest-lower-bound match in my 
application.

findInterval is also slow for large N=length(vec) because of the O(N)
checking it does, as Duncan Murdoch has pointed
out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>:
though
its documentation says it runs in O(n * log(N)), it actually 
runs in O(n *
log(N) + N), which is quite noticeable for largish N.  But 
that is easy
enough to work around by writing a variant of findInterval which calls
find_interv_vec without checking.

            -s

PS Yes, binary search is a one-liner in R, but I always prefer to use
standard, fast native libraries when possible....

binarysearch <- function(val,tab,L,H) {while (H>=L) { 
M=L+(H-L) %/% 2; if
(tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)};
return(L-1)}

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Try data.table:::sortedmatch, which is implemented in C.
It requires it's input to be sorted (and doesn't check)

"Stavros Macrakis" <macrakis at alum.mit.edu> wrote in message 
news:BANLkTi=j2LF5SyXYTV1Dd4k9wr0ZGK88ig at mail.gmail.com...
Is there a generic binary search routine in a standard library which

  a) works for character vectors
  b) runs in O(log(N)) time?

I'm aware of findInterval(x,vec), but it is restricted to numeric vectors.

I'm also aware of various hashing solutions (e.g. new.env(hash=TRUE) and
fastmatch), but I need the greatest-lower-bound match in my application.

findInterval is also slow for large N=length(vec) because of the O(N)
checking it does, as Duncan Murdoch has pointed
out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>:
though
its documentation says it runs in O(n * log(N)), it actually runs in O(n *
log(N) + N), which is quite noticeable for largish N.  But that is easy
enough to work around by writing a variant of findInterval which calls
find_interv_vec without checking.

           -s

PS Yes, binary search is a one-liner in R, but I always prefer to use
standard, fast native libraries when possible....

binarysearch <- function(val,tab,L,H) {while (H>=L) { M=L+(H-L) %/% 2; if
(tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)};
return(L-1)}

[[alternative HTML version deleted]]

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Stavros Macrakis
Sent: Monday, April 04, 2011 1:15 PM
To: r-help
Subject: [R] General binary search?

Is there a generic binary search routine in a standard library which

    a) works for character vectors
    b) runs in O(log(N)) time?

I'm aware of findInterval(x,vec), but it is restricted to
numeric vectors.
xtfrm(x) will convert a character (or other) vector to
a numeric vector with the same ordering.  findInterval
can work on that.  E.g.,
    >  f0<- function(x, vec) {
        tmp<- xtfrm(c(x, vec))
        findInterval(tmp[seq_along(x)], tmp[-seq_along(x)])
      }
    >  f0(c("Baby", "Aunt", "Dog"), LETTERS)
    [1] 2 1 4
I've never looked at its speed.
For a little progress (though no 'generic binary searchin a standard 
library'), here's the 'one-liner'

bsearch1 <-
     function(val, tab, L=1L, H=length(tab))
{
     while (H >= L) {
         M <- L + (H - L) %/% 2L
         if (tab[M] > val) H <- M - 1L
         else if (tab[M] < val) L <- M + 1L
         else return(M)
     }
     return(L - 1L)
}

It seems like a good candidate for the new (R-2.13) 'compiler' package, so

library(compiler)
bsearch2 <- cmpfun(bsearch1)

And Bill's suggestion

bsearch3 <- function(val, tab) {
     tmp <- xtfrm(c(val, tab))
     findInterval(tmp[seq_along(val)], tmp[-seq_along(val)])
}

which will work best when 'val' is a vector to be looked up.

A quick look at data.table:::sortedmatch seemed to return matches, 
whereas Stavros is looking for lower bounds.

It seems that one could shift the weight more to C code by 'vectorizing' 
the one-liner, first as

bsearch5 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         i <- tab[M] > val[i0]
         b[i0 + i * length(val)] <-
             ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M))
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
}

and then a little more thoughtfully (though more room for improvement) as

bsearch7 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         tabM <- tab[M]
         val0 <- val[i0]
         i <- tabM < val0
         updt[i] <- M[i] + 1L
         i <- tabM > val0
         updt[i] <- M[i] - 1L
         b[i0 + i * length(val)] <- updt
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
}

none of bsearch 3, 5, or 7 is likely to benefit substantially from 
compilation.

Here's a little test data set converting numeric to character as an easy 
cheat.

set.seed(123L)
x <- sort(as.character(rnorm(1e6)))
y <- as.character(rnorm(1e4))

There seems to be some significant initial overhead, so we warm things 
up (and also introduce the paradigm for multiple look-ups in bsearch 1, 2)

warmup <- function(y, x) {
     lapply(y, bsearch1, x)
     lapply(y, bsearch2, x)
     bsearch3(y, x)
     bsearch5(y, x)
     bsearch7(y, x)
}
replicate(3, warmup(y, x))

and then time

 > system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE))
    user  system elapsed
   2.692   0.000   2.696
 > system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE))
    user  system elapsed
   1.379   0.000   1.380
 > identical(res1, res2)
[1] TRUE
 > system.time(res3 <- bsearch3(y, x)); identical(res1, res3)
    user  system elapsed
   8.339   0.001   8.350
[1] TRUE
 > system.time(res5 <- bsearch5(y, x)); identical(res1, res5)
    user  system elapsed
   0.700   0.000   0.702
[1] TRUE
 > system.time(res7 <- bsearch7(y, x)); identical(res1, res7)
    user  system elapsed
   0.222   0.000   0.222
[1] TRUE

Martin
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

I'm also aware of various hashing solutions (e.g.
new.env(hash=TRUE) and
fastmatch), but I need the greatest-lower-bound match in my
application.

findInterval is also slow for large N=length(vec) because of the O(N)
checking it does, as Duncan Murdoch has pointed
out<https://stat.ethz.ch/pipermail/r-help/2008-September/174584.html>:
though
its documentation says it runs in O(n * log(N)), it actually
runs in O(n *
log(N) + N), which is quite noticeable for largish N.  But
that is easy
enough to work around by writing a variant of findInterval which calls
find_interv_vec without checking.

             -s

PS Yes, binary search is a one-liner in R, but I always prefer to use
standard, fast native libraries when possible....

binarysearch<- function(val,tab,L,H) {while (H>=L) {
M=L+(H-L) %/% 2; if
(tab[M]>val) H<-M-1 else if (tab[M]<val) L<-M+1 else return(M)};
return(L-1)}

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110407/022220b5/attachment.pl>