An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120712/e1528326/attachment.pl>
How to find frequent sequences.
4 messages · Vineet Shukla, Petr Savicky
On Thu, Jul 12, 2012 at 03:51:54PM -0500, Vineet Shukla wrote:
I have independent event sequences for example as follows :
Independent event sequence 1 : A , B , C , D
Independent event sequence 2 : A, C , B
Independent event sequence 3 :D, A, B, X,Y, Z
Independent event sequence 4 :C,A,A,B
Independent event sequence 5 :B,A,D
I want to able to find that most common sequence patters as
{A, B } = > 3
from lines 1,3,5.
Pls note that A,C,B must not be considered because C comes in between
and line 5 also must not be considered because order of A,B is reversed.
Hi.
If i understand correctly, the first sequence contains patterns
AB, BC, CD.
Using this interpretation, AB occurs at lines 1,3,4 and not 1,3,5.
Is this correct?
If some sequence contains several ocurrences of a pattern, for example,
the sequence
A, B, A, B
contains AB twice, then it is counted only once?
If this is correct, then try the following
# your input list
lst <- list(
c("A", "B", "C", "D"),
c("A", "C", "B"),
c("D", "A", "B", "X", "Y", "Z"),
c("C", "A", "A", "B"),
c("B", "A", "D"))
# extract unique patterns from a single sequence as rows of a matrix
# lpattern is the length of the patterns
singleSeq <- function(x, lpattern)
{
unique(embed(rev(x), lpattern))
}
lst1 <- lapply(lst, singleSeq, lpattern=2)
# combine the matrices to a single matrix
mat <- do.call(rbind, lst1)
# convert the patters to strings
pat <- do.call(paste, c(data.frame(mat), sep=""))
out <- table(pat)
out
pat
AA AB AC AD BA BC BX CA CB CD DA XY YZ
1 3 1 1 1 1 1 1 1 1 1 1 1
names(out)[which.max(out)]
[1] "AB"
Hope this helps.
Petr Savicky.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120713/4cf5d69b/attachment.pl>
On Fri, Jul 13, 2012 at 02:38:57PM -0500, Vineet Shukla wrote:
Hi Petr, Yes, that's really very helpful. Petr : Using this interpretation, AB occurs at lines 1,3,4 and not 1,3,5. Is this correct? Vineet : Yes , thats right sorry for the typo. Petr: If some sequence contains several ocurrences of a pattern, for example, the sequence A, B, A, B contains AB twice, then it is counted only once? Vineet : what needs to be done if I would like to count it as many times as it occurred ? remove dont call unique function from "unique(embed(rev(x), lpattern))" ?
Hi. Yes. Without unique() the matrix embed(rev(x), lpattern) contains all occurences in one sequence and the final result will be the sum of the numbers of occurences in all sequences. Petr.