Skip to content
Prev 44145 / 398502 Next

RES: [R] AGREP

Hi Henrik,

	Your function is really faster, but I tested it to solve my
problem. And I found it is too time consuming yet for me. This happens
because I need to compare strings from two very large vectors. Bellow I
wrote the syntax I have to use:

##########################

Ls1<-length(s1)
Ls2<-length(s2) 
for ( p in 1:ls1){
   for (q in 1:ls2){
     t1<-levenshteinFast(s1[p],s2[q])
     if (t1<s12[p]){
       s12[p]<-s2[q]
       n12[q]<-t1}
   }
}

#############################

If you want to have na idea, the size of my loop are:

Ls1=42000
Ls2=70000

I think I will wait for months untill this program ends. Do you have any
sugestion to increase the speed? The AGREP function is much faster, and
I am using it, but I don't have a efficient comparation because
AGREP(a,b) is not equal AGREP(b,a). 

Thanks,

Marcos



-----Mensagem original-----
De: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] Em nome de Henrik Bengtsson
Enviada em: quinta-feira, 12 de fevereiro de 2004 13:12
Para: ggrothendieck at myway.com; rodrigo.abt at sii.cl;
r-help at stat.math.ethz.ch
Assunto: RE: [R] AGREP
other
But a very expensive part of the code though is the substr() calls.
Instead of doing this nchar(a)*nchar(b) times it's enough to do it
nchar(a)+nchar(b). Even better is to use strsplit() first as below:

levenshteinFast <- function(s1,s2) {
  # Make sure args are strings
  a <- as.character(s1)
  b <- as.character(s2)

  # Split strings into vectors
  a <- strsplit(a, split="")[[1]]
  b <- strsplit(b, split="")[[1]]
  
  # If one arg is an empty string, returns the length of the other
  an <- length(a)
  bn <- length(b)
  if (an==0) return(bn)
  if (bn==0) return(an)

  # Initialize matrix for calculations
  m <- matrix(0, nrow=an+1, ncol=bn+1)
  m[1,] <- 1:(bn+1)
  m[,1] <- 1:(an+1)

  # Cost calculation - line beginning (substr... is 0-1 cost f'n
  for (i in 2:(an+1)) {
    for (j in 2:(bn+1)) {
      m[i,j] <- min(m[i-1,j  ] + 1,
                    m[i  ,j-1] + 1,
                    m[i-1,j-1] + !identical(a[i-1],b[j-1]))
    }
  }

  # Returns the distance
  m[an+1,bn+1]-1;
} # levenshteinFast()


# Example
N <- 500
s1 <- sample(letters, size=N, replace=TRUE)
s1 <- paste(s1, collapse="")
s2 <- sample(letters, size=N, replace=TRUE)
s2 <- paste(s2, collapse="")

t1 <- system.time(dist1 <- levenshtein(s1,s2))
print(c(t1,dist1))
# [1]  46.83   0.23  54.24     NA     NA 443.00

t2 <- system.time(dist2 <- levenshteinFast(s1,s2))
print(c(t2,dist2))
# [1]  18.82   0.07  20.90     NA     NA 443.00

/Henrik
other
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html