Skip to content
Prev 44118 / 398502 Next

AGREP

One could shorten it slightly with these minor improvements.  Unfortunately, the key performance problem, the double loop 
at the end which implements the dynamic programming calculation, 
is still there.

levenshtein<-function(s1,s2) {
     # Make sure args are strings
     a <- as.character(s1); an <- nchar(s1)+1
     b <- as.character(s2); bn <- nchar(s2)+1

     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0) return(nchar(b))
     if (nchar(b)==0) return(nchar(a))

     # Initialize matrix for calculations
     m <- matrix(0, nrow=an, ncol=bn)
     m[1,] <- 1:bn
     m[,1] <- 1:an 

     # Cost calculation - line beginning (substr... is 0-1 cost f'n
     for (i in 2:an) 
          for (j in 2:bn) 
		  m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
		       (substr(a,i-1,i-1)!=substr(b,j-1,j-1)) ) 

     # Returns the distance
     m[an,bn]-1
}

---
Date:   Thu, 12 Feb 2004 11:01:19 -0300 
From:   Rodrigo Abt <rodrigo.abt at sii.cl>
To:   'Lista de Correo de R' <r-help at stat.math.ethz.ch> 
Subject:   Re: [R] AGREP 

 
"Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> writes:
I guess you're looking for Levenshtein distance, so try this:

levenshtein<-function(s1,s2) {
     # Make sure args are strings
     a<-as.character(s1);an=nchar(s1)+1
     b<-as.character(s2);bn=nchar(s2)+1

     # Initialize matrix for calculations
     m<-matrix(nrow=an,ncol=bn)

     # If one arg is an empty string, returns the length of the other
     if (nchar(a)==0)
          return(nchar(b))
     if (nchar(b)==0)
          return(nchar(a))

     # Matrix initialization
     for (i in 1:an) {
          for (j in 1:bn) {
               m[i,j]<-0
               m[1,j]<-j
          }
          m[i,1]<-i
     }

     # Cost calculation
     for (i in 2:an) {
          for (j in 2:bn) {
               if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
                    cost<-0
               else
                    cost<-1
          m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
          }
     }
     # Returns the distance
     m[an,bn]-1
}

Examples:
[1] 1
and one substitution
[1] 3

Note that this function IS case sensitive. If you want to apply this on
vectors of strings you'll have to write the
corresponding wrapper function.

Hope that helps,

Rodrigo Abt B,
Analyst,
Dept. Economic Studies,
SII, Chile.