Skip to content
Prev 255967 / 398506 Next

General binary search?

On 04/04/2011 01:50 PM, William Dunlap wrote:
For a little progress (though no 'generic binary searchin a standard 
library'), here's the 'one-liner'

bsearch1 <-
     function(val, tab, L=1L, H=length(tab))
{
     while (H >= L) {
         M <- L + (H - L) %/% 2L
         if (tab[M] > val) H <- M - 1L
         else if (tab[M] < val) L <- M + 1L
         else return(M)
     }
     return(L - 1L)
}

It seems like a good candidate for the new (R-2.13) 'compiler' package, so

library(compiler)
bsearch2 <- cmpfun(bsearch1)

And Bill's suggestion

bsearch3 <- function(val, tab) {
     tmp <- xtfrm(c(val, tab))
     findInterval(tmp[seq_along(val)], tmp[-seq_along(val)])
}

which will work best when 'val' is a vector to be looked up.

A quick look at data.table:::sortedmatch seemed to return matches, 
whereas Stavros is looking for lower bounds.

It seems that one could shift the weight more to C code by 'vectorizing' 
the one-liner, first as

bsearch5 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         i <- tab[M] > val[i0]
         b[i0 + i * length(val)] <-
             ifelse(i, M - 1L, ifelse(tab[M] < val[i0], M + 1L, M))
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
}

and then a little more thoughtfully (though more room for improvement) as

bsearch7 <-
     function(val, tab, L=1L, H=length(tab))
{
     b <- cbind(L=rep(L, length(val)), H=rep(H, length(val)))
     i0 <- seq_along(val)
     repeat {
         updt <- M <- b[i0,"L"] + (b[i0,"H"] - b[i0,"L"]) %/% 2L
         tabM <- tab[M]
         val0 <- val[i0]
         i <- tabM < val0
         updt[i] <- M[i] + 1L
         i <- tabM > val0
         updt[i] <- M[i] - 1L
         b[i0 + i * length(val)] <- updt
         i0 <- which(b[i0, "H"] >= b[i0, "L"])
         if (!length(i0)) break;
     }
     b[,"L"] - 1L
}

none of bsearch 3, 5, or 7 is likely to benefit substantially from 
compilation.

Here's a little test data set converting numeric to character as an easy 
cheat.

set.seed(123L)
x <- sort(as.character(rnorm(1e6)))
y <- as.character(rnorm(1e4))

There seems to be some significant initial overhead, so we warm things 
up (and also introduce the paradigm for multiple look-ups in bsearch 1, 2)

warmup <- function(y, x) {
     lapply(y, bsearch1, x)
     lapply(y, bsearch2, x)
     bsearch3(y, x)
     bsearch5(y, x)
     bsearch7(y, x)
}
replicate(3, warmup(y, x))

and then time

 > system.time(res1 <- unlist(lapply(y, bsearch1, x), use.names=FALSE))
    user  system elapsed
   2.692   0.000   2.696
 > system.time(res2 <- unlist(lapply(y, bsearch2, x), use.names=FALSE))
    user  system elapsed
   1.379   0.000   1.380
 > identical(res1, res2)
[1] TRUE
 > system.time(res3 <- bsearch3(y, x)); identical(res1, res3)
    user  system elapsed
   8.339   0.001   8.350
[1] TRUE
 > system.time(res5 <- bsearch5(y, x)); identical(res1, res5)
    user  system elapsed
   0.700   0.000   0.702
[1] TRUE
 > system.time(res7 <- bsearch7(y, x)); identical(res1, res7)
    user  system elapsed
   0.222   0.000   0.222
[1] TRUE

Martin