Skip to content

Search for locations of subsequences?

6 messages · Marc Schwartz, Rui Barradas, Duncan Murdoch

#
Is there a function to efficiently search for a subsequence within a vector?

For example, with

x <- 1:100

I'd like to search for the sequence c(49,50,51), and be told that it 
occurs exactly once, starting at location 49.  (The items in the vectors 
might be numeric or character, and there might be repetitions within the 
search pattern or within the vector I'm searching.)

Duncan Murdoch
#
On Aug 28, 2012, at 4:05 PM, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

            
There is a thread here:

  http://tolstoy.newcastle.edu.au/R/e17/help/12/02/4201.html

that has various solutions and towards the end has a post with some timings.

Regards,

Marc Schwartz
#
Hello,

Try the following.

occur1 <- function(pat, vec){
     m <- length(pat)
     n <- length(vec)
     candidate <- seq.int(length=n-m+1)
     for (i in seq.int(length=m))
         candidate <- candidate[pat[i] == vec[candidate + i - 1]]
     candidate
}

patrn <- c(1,2,3,4)
exmpl <- c(3,3,4,2,3,1,2,3,4,8,8,23,1,2,3,4,4,34,4,3,2,1,1,2,3,4)

occur1(patrn, exmpl)
[1]  6 13 23

This was posted some time ago. And seems to be the fastest of the 7 
solutions.
https://stat.ethz.ch/pipermail/r-help/2012-February/303756.html

Hope this helps,

Rui Barradas

Em 28-08-2012 22:05, Duncan Murdoch escreveu:
#
On 12-08-28 5:32 PM, Marc Schwartz wrote:
Thanks, that's helpful.  I wonder if we should have a function based on 
the fast solution in that thread (or something even faster written all 
in compiled code; I suspect Boyer-Moore could do well, but not coded in R).

Duncan Murdoch
#
On 12-08-28 5:45 PM, Rui Barradas wrote:
Thanks!

Duncan Murdoch
#
Hello,

Function Biostrings::matchPattern can be called with an algorithm = 
"boyer-moore" argument.
I've never used it, this is the return value of

library(sos)
r1 <- findFn('boyer')
r2 <- findFn('moore')
r1 & r2

I have implemented the Boyer-Moore algorithm a couple of times, the 
first(!) of all in 8086 assembly, but I'm seeing a difficulty regarding 
your original request.
A Boyer-Moore algorithm to search for subsequences of character vectors 
all of which such that nchar(x) is 1 should be very easy to implement 
using the .Call interface, but for integer vectors I am not seeing how 
to implement the bad character shift table. What would be the alphabet? 
The set of 32-bit integers? In this case the table length would be 
prohibitive...

Ideas anyone?

Rui Barradas

Em 28-08-2012 22:05, Duncan Murdoch escreveu: