Skip to content

How to find frequent sequences.

4 messages · Vineet Shukla, Petr Savicky

#
On Thu, Jul 12, 2012 at 03:51:54PM -0500, Vineet Shukla wrote:
Hi.

If i understand correctly, the first sequence contains patterns

  AB, BC, CD.

Using this interpretation, AB occurs at lines 1,3,4 and not 1,3,5.
Is this correct?

If some sequence contains several ocurrences of a pattern, for example,
the sequence

   A, B, A, B

contains AB twice, then it is counted only once?

If this is correct, then try the following

  # your input list
  lst <- list(
  c("A", "B", "C", "D"),
  c("A", "C", "B"),
  c("D", "A", "B", "X", "Y", "Z"),
  c("C", "A", "A", "B"),
  c("B", "A", "D"))

  # extract unique patterns from a single sequence as rows of a matrix 
  # lpattern is the length of the patterns
  singleSeq <- function(x, lpattern)
  {
      unique(embed(rev(x), lpattern))
  }
 
  lst1 <- lapply(lst, singleSeq, lpattern=2)
  # combine the matrices to a single matrix
  mat <- do.call(rbind, lst1)
  # convert the patters to strings
  pat <- do.call(paste, c(data.frame(mat), sep=""))
  out <- table(pat)
  out

  pat
  AA AB AC AD BA BC BX CA CB CD DA XY YZ 
   1  3  1  1  1  1  1  1  1  1  1  1  1 

  names(out)[which.max(out)]

  [1] "AB"

Hope this helps.

Petr Savicky.
#
On Fri, Jul 13, 2012 at 02:38:57PM -0500, Vineet Shukla wrote:
Hi.

Yes. Without unique() the matrix embed(rev(x), lpattern) contains all occurences
in one sequence and the final result will be the sum of the numbers of
occurences in all sequences.

Petr.