Hi Henrik,
Your function is really faster, but I tested it to solve my
problem. And I found it is too time consuming yet for me. This happens
because I need to compare strings from two very large vectors. Bellow I
wrote the syntax I have to use:
##########################
Ls1<-length(s1)
Ls2<-length(s2)
for ( p in 1:ls1){
for (q in 1:ls2){
t1<-levenshteinFast(s1[p],s2[q])
if (t1<s12[p]){
s12[p]<-s2[q]
n12[q]<-t1}
}
}
#############################
If you want to have na idea, the size of my loop are:
Ls1=42000
Ls2=70000
I think I will wait for months untill this program ends. Do you have any
sugestion to increase the speed? The AGREP function is much faster, and
I am using it, but I don't have a efficient comparation because
AGREP(a,b) is not equal AGREP(b,a).
Thanks,
Marcos
-----Mensagem original-----
De: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] Em nome de Henrik Bengtsson
Enviada em: quinta-feira, 12 de fevereiro de 2004 13:12
Para: ggrothendieck at myway.com; rodrigo.abt at sii.cl;
r-help at stat.math.ethz.ch
Assunto: RE: [R] AGREP
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Gabor
Grothendieck
Sent: den 12 februari 2004 16:07
To: rodrigo.abt at sii.cl; r-help at stat.math.ethz.ch
Subject: Re: [R] AGREP
One could shorten it slightly with these minor improvements.
Unfortunately, the key performance problem, the double loop
at the end which implements the dynamic programming calculation,
is still there.
levenshtein<-function(s1,s2) {
# Make sure args are strings
a <- as.character(s1); an <- nchar(s1)+1
b <- as.character(s2); bn <- nchar(s2)+1
# If one arg is an empty string, returns the length of the
other
if (nchar(a)==0) return(nchar(b))
if (nchar(b)==0) return(nchar(a))
# Initialize matrix for calculations
m <- matrix(0, nrow=an, ncol=bn)
m[1,] <- 1:bn
m[,1] <- 1:an
# Cost calculation - line beginning (substr... is 0-1 cost f'n
for (i in 2:an)
for (j in 2:bn)
m[i,j] <- min( m[i-1,j]+1, m[i,j-1]+1, m[i-1,j-1]+
(substr(a,i-1,i-1)!=substr(b,j-1,j-1)) )
# Returns the distance
m[an,bn]-1
}
But a very expensive part of the code though is the substr() calls.
Instead of doing this nchar(a)*nchar(b) times it's enough to do it
nchar(a)+nchar(b). Even better is to use strsplit() first as below:
levenshteinFast <- function(s1,s2) {
# Make sure args are strings
a <- as.character(s1)
b <- as.character(s2)
# Split strings into vectors
a <- strsplit(a, split="")[[1]]
b <- strsplit(b, split="")[[1]]
# If one arg is an empty string, returns the length of the other
an <- length(a)
bn <- length(b)
if (an==0) return(bn)
if (bn==0) return(an)
# Initialize matrix for calculations
m <- matrix(0, nrow=an+1, ncol=bn+1)
m[1,] <- 1:(bn+1)
m[,1] <- 1:(an+1)
# Cost calculation - line beginning (substr... is 0-1 cost f'n
for (i in 2:(an+1)) {
for (j in 2:(bn+1)) {
m[i,j] <- min(m[i-1,j ] + 1,
m[i ,j-1] + 1,
m[i-1,j-1] + !identical(a[i-1],b[j-1]))
}
}
# Returns the distance
m[an+1,bn+1]-1;
} # levenshteinFast()
# Example
N <- 500
s1 <- sample(letters, size=N, replace=TRUE)
s1 <- paste(s1, collapse="")
s2 <- sample(letters, size=N, replace=TRUE)
s2 <- paste(s2, collapse="")
t1 <- system.time(dist1 <- levenshtein(s1,s2))
print(c(t1,dist1))
# [1] 46.83 0.23 54.24 NA NA 443.00
t2 <- system.time(dist2 <- levenshteinFast(s1,s2))
print(c(t2,dist2))
# [1] 18.82 0.07 20.90 NA NA 443.00
/Henrik
---
Date: Thu, 12 Feb 2004 11:01:19 -0300
From: Rodrigo Abt <rodrigo.abt at sii.cl>
To: 'Lista de Correo de R' <r-help at stat.math.ethz.ch>
Subject: Re: [R] AGREP
"Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> writes:
Hi listers
If you don't know what is the Edit Distance beetwen two
strings, I will
try to explain, in fact it is very simple to understund but not to
calculate througth a program. It is simplilly the minimum number of
operations you must perform to transform string A on string B, by
operations I mean delete letters, insert letters or
substitute letter.
If you need to do few operations, it means string A is
almost the same
as string B. Otherwise they are more differente as the number of
operations increase.
If you have a idea of how to make a function to calculate this
distance, it would be very important for me.
Thanks very much,
Marcos
I guess you're looking for Levenshtein distance, so try this:
levenshtein<-function(s1,s2) {
# Make sure args are strings
a<-as.character(s1);an=nchar(s1)+1
b<-as.character(s2);bn=nchar(s2)+1
# Initialize matrix for calculations
m<-matrix(nrow=an,ncol=bn)
# If one arg is an empty string, returns the length of the
other
if (nchar(a)==0)
return(nchar(b))
if (nchar(b)==0)
return(nchar(a))
# Matrix initialization
for (i in 1:an) {
for (j in 1:bn) {
m[i,j]<-0
m[1,j]<-j
}
m[i,1]<-i
}
# Cost calculation
for (i in 2:an) {
for (j in 2:bn) {
if (substr(a,i-1,i-1)==substr(b,j-1,j-1))
cost<-0
else
cost<-1
m[i,j]=min(m[i-1,j]+1,m[i,j-1]+1,m[i-1,j-1]+cost)
}
}
# Returns the distance
m[an,bn]-1
}
Examples:
levenshtein("Great","Grreat") <-- One addition
[1] 1
levenshtein("mahrcoz","Marcos") <-- One substitution,one deletion
and one substitution
[1] 3
Note that this function IS case sensitive. If you want to
apply this on vectors of strings you'll have to write the
corresponding wrapper function.
Hope that helps,
Rodrigo Abt B,
Analyst,
Dept. Economic Studies,
SII, Chile.