Skip to content

matching a sequence in a vector?

16 messages · Redding, Matthew, chuck.01, Rainer M Krug +5 more

#
this is ugly, but...

l <-length(patrn)
l2 <-length(exmpl)

out <- vector("list")
for(i in 1:(l2-l+1))
     {
exmpl[i:(i+l-1)]
patrn==exmpl[i:(i+l-1)]
if(all(patrn==exmpl[i:(i+l-1)]))
{ out[[i]] <- i  } else { out[[i]] <- "NA"}
      }

out <- do.call(c, out)
as.numeric(out[which(out!="NA")])


## Cheers and HTH



Redding, Matthew-2 wrote
--
View this message in context: http://r.789695.n4.nabble.com/matching-a-sequence-in-a-vector-tp4389523p4389560.html
Sent from the R help mailing list archive at Nabble.com.
#
On Wed, Feb 15, 2012 at 02:17:35PM +1000, Redding, Matthew wrote:
Hi.

If the pattern is not too long, try

  m <- length(patrn)
  n <- length(exmpl)
  ind <- seq.int(length=n-m+1)
  occur <- rep(TRUE, times=n-m+1)
  for (i in seq.int(length=m)) {
      occur <- occur & (patrn[i] == exmpl[ind + i - 1])
  }
  which(occur)

  [1]  6 13 23

Hope this helps.

Petr Savicky.
#
On Wed, Feb 15, 2012 at 02:17:35PM +1000, Redding, Matthew wrote:
Hi.

A more efficient version of the previous suggestion
is as follows.

  m <- length(patrn)
  n <- length(exmpl)
  candidate <- seq.int(length=n-m+1)
  for (i in seq.int(length=m)) {
      candidate <- candidate[patrn[i] == exmpl[candidate + i - 1]]
  }
  candidate

  [1]  6 13 23

In this solution, the set of candidate indices decreases. If
the prefixes of the searched pattern are rare, the set of
candidates is reduced in a few iterations and the remaining
iterations become faster.

Hope this helps.

Petr Savicky.
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15/02/12 05:17, Redding, Matthew wrote:
No actual solution - but this sounds to me like a "moving window"
statistic. I googled for "moving window R" and found, among others,
the following:

http://tolstoy.newcastle.edu.au/R/help/04/10/5161.html

Maybe this can give you some additional ideas - for your question, you
would not calculate the e.g. mean of the moving window, but check if
the sequence in the window is equal to the one you are looking for.

Cheers,

Rainer
- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk87b3wACgkQoYgNqgF2egpqxACeNIMFFIDM6oqyejLR5yewNz2W
R2AAn1elVRr0zqbADRFyZupWnMirAuZy
=BXd9
-----END PGP SIGNATURE-----
#
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
On 15/02/12 05:17, Redding, Matthew wrote:
Just another idea: what about converting the vector to a character vector

exmplstr <- paste(exmpl, collapse="")
patrnstr <- paste(patrn, collapse="")

and then search for patrnstr in exmplstr?

Rainer
- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Stellenbosch University
South Africa

Tel :       +33 - (0)9 53 10 27 44
Cell:       +33 - (0)6 85 62 59 98
Fax :       +33 - (0)9 58 10 27 44

Fax (D):    +49 - (0)3 21 21 25 22 44

email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk87dJ4ACgkQoYgNqgF2egojXACcDPXGfqB+8+Cmz80z+plX376y
FX4An1+PWTr3OJceYVCHYxz4y02FZ/ei
=ru9x
-----END PGP SIGNATURE-----
#
On 15-02-2012, at 05:17, Redding, Matthew wrote:

            
patrn.rev <- rev(patrn)
w <- embed(exmpl,length(patrn))
w.pos <- apply(w,1,function(r) all(r == patrn.rev))
which(w.pos)

You can substitute the last three lines to get a oneliner.

Berend
#
On Wed, Feb 15, 2012 at 10:26:44AM +0100, Berend Hasselman wrote:
Hi.

If the speed is an issue and exmpl is long, the
following modification may be faster.

  patrn.rev <- rev(patrn)
  w <- embed(exmpl,length(patrn))
  which(rowSums(w == rep(patrn.rev, each=nrow(w))) == ncol(w))

  [1]  6 13 23

For length(patrn) = 11 and length(exmpl) = 10000, i obtained
a speed up by a factor of 10.

Hope this helps.

How large are the vectors "patrn" and "exmpl" in your application?

Petr Savicky.
#
On 02/14/2012 11:45 PM, Petr Savicky wrote:
match(exmpl, patrn) returns indexes that differ by 1 if the sequence 
patrn occurs

   n = length(patrn)
   r = rle(diff(match(exmpl, patrn)) == 1)

we're looking for a run of TRUE's of length 3, and can find their ends 
(of the runs of diffs) as cumsum(r$length)

   cumsum(r$length)[r$values & r$length == (n - 1)] - (n - 2)

Seems like there could be edge cases that I'm missing...

Martin

  
    
#
On Wed, Feb 15, 2012 at 06:27:01AM -0800, Martin Morgan wrote:
Hi Martin:

This is a nice solution. In my opinion, it works, whenever "patrn"
does not contain duplicates.

Petr.
#
On 15-02-2012, at 15:27, Martin Morgan wrote:

            
Clever.
However it is quite slow.

Berend
#
Thankyou all for your great, and creative solutions.

There is definately more than one way to skin a cat.

A colleague alerted me to another solution:

seq.strt <- which( sapply( 1:(length(exmpl)-length(patrn)+1), function(i) isTRUE( all.equal( patrn, exmpl[ i + 0:(length(patrn)-1) ] ) ) ) ) 

Apparently this came from a post in the help archive that my searches missed.

I think the solutions you have put up are more readable.

Kind regards

Matt
********************************DISCLAIMER****************************
The information contained in the above e-mail message or messages 
(which includes any attachments) is confidential and may be legally 
privileged.  It is intended only for the use of the person or entity 
to which it is addressed.  If you are not the addressee any form of 
disclosure, copying, modification, distribution or any action taken 
or omitted in reliance on the information is unauthorised.  Opinions 
contained in the message(s) do not necessarily reflect the opinions 
of the Queensland Government and its authorities.  If you received 
this communication in error, please notify the sender immediately 
and delete it from your computer system network.
#
yet another solution (I think)
+     all(exmpl[.indx] == patrn)
+ })
[1] 23 13  6
On Wed, Feb 15, 2012 at 6:32 PM, Redding, Matthew
<Matthew.Redding at deedi.qld.gov.au> wrote:

  
    
#
On Tue, Feb 14, 2012 at 11:17 PM, Redding, Matthew
<Matthew.Redding at deedi.qld.gov.au> wrote:
Here is a one-liner:

library(zoo)
which(rollapply(exmpl, 4, identical, patrn, fill = FALSE, align = "left"))
#
On Wed, Feb 15, 2012 at 08:12:32PM -0500, Gabor Grothendieck wrote:
Hi.

There were several solutions in this thread. Their speed differs
quite significantly. Here is a comparison.

  patrn <- 1:4
  exmpl <- sample(1:4, 10000, replace=TRUE)
  
  occur1 <- function(patrn, exmpl)
  {
    m <- length(patrn)
    n <- length(exmpl)
    candidate <- seq.int(length=n-m+1)
    for (i in seq.int(length=m)) {
        candidate <- candidate[patrn[i] == exmpl[candidate + i - 1]]
    }
    candidate
  }
  
  occur2 <- function(patrn, exmpl)
  {
    patrn.rev <- rev(patrn)
    w <- embed(exmpl,length(patrn))
    which(apply(w,1,function(r) all(r == patrn.rev)))
  }
  
  occur3 <- function(patrn, exmpl)
  {
    patrn.rev <- rev(patrn)
    w <- embed(exmpl,length(patrn))
    which(rowSums(w == rep(rev(patrn), each=nrow(w))) == ncol(w))
  }
  
  occur4 <- function(patrn, exmpl)
  {
    # requires patrn without duplicates
    n = length(patrn)
    r = rle(diff(match(exmpl, patrn)) == 1L)
    cumsum(r$length)[r$values & r$length == (n - 1L)] - (n - 2L)
  }
  
  occur5 <- function(patrn, exmpl)
  {
    which( sapply( 1:(length(exmpl)-length(patrn)+1), function(i) isTRUE( all.equal( patrn, exmpl[i + 0:(length(patrn)-1) ] ) ) ) )
  }
  
  occur6 <- function(patrn, exmpl)
  {
    indx <- embed(rev(seq_along(exmpl)), length(patrn))
    matches <- apply(indx, 1, function(.indx){
        all(exmpl[.indx] == patrn)
    })
    rev(indx[matches, 1L])
  }
  
  occur7 <- function(patrn, exmpl)
  {
    which(rollapply(exmpl, length(patrn), identical, patrn, fill = FALSE, align = "left"))
  }
  
  library(zoo)
  
  t1 <- system.time( out1 <- occur1(patrn, exmpl) )
  t2 <- system.time( out2 <- occur2(patrn, exmpl) )
  t3 <- system.time( out3 <- occur3(patrn, exmpl) )
  t4 <- system.time( out4 <- occur4(patrn, exmpl) )
  t5 <- system.time( out5 <- occur5(patrn, exmpl) )
  t6 <- system.time( out6 <- occur6(patrn, exmpl) )
  t7 <- system.time( out7 <- occur7(patrn, exmpl) )
  
  print(identical(out1, out2))
  print(identical(out1, out3))
  print(identical(out1, out4))
  print(identical(out1, out5))
  print(identical(out1, out6))
  print(identical(out1, out7))
  print(rbind(t1, t2, t3, t4, t5, t6, t7))

The output was

  [1] TRUE
  [1] TRUE
  [1] TRUE
  [1] TRUE
  [1] TRUE
  [1] TRUE
     user.self sys.self elapsed user.child sys.child
  t1     0.001        0   0.001          0         0
  t2     0.062        0   0.061          0         0
  t3     0.002        0   0.002          0         0
  t4     0.001        0   0.001          0         0
  t5     1.749        0   1.749          0         0
  t6     0.068        0   0.068          0         0
  t7     0.172        0   0.172          0         0

Petr Savicky.
#
On 16-02-2012, at 09:01, Petr Savicky wrote:

            
And by modifying occur5 to this

 occur5 <- function(patrn, exmpl)
 {
   which( sapply( 1:(length(exmpl)-length(patrn)+1), 
          function(i) identical( patrn, exmpl[i + 0:(length(patrn)-1) ] ) ) )
 }


occur5 can be made a lot faster.
On my computer instead of

   user.self sys.self elapsed user.child sys.child
t1     0.001    0.000   0.001          0         0
t2     0.061    0.007   0.068          0         0
t3     0.002    0.001   0.002          0         0
t4     0.001    0.000   0.002          0         0
t5     1.640    0.037   1.677          0         0
t6     0.079    0.004   0.084          0         0
t7     0.256    0.004   0.260          0         0

I got

   user.self sys.self elapsed user.child sys.child
t1     0.000    0.000   0.001          0         0
t2     0.060    0.004   0.065          0         0
t3     0.002    0.001   0.003          0         0
t4     0.001    0.000   0.002          0         0
t5     0.070    0.002   0.071          0         0
t6     0.076    0.000   0.077          0         0
t7     0.246    0.006   0.252          0         0


Berend