Skip to content

please comment on my function

4 messages · Sam Steingold, jim holtman

#
this function is supposed to canonicalize the language:

--8<---------------cut here---------------start------------->8---
canonicalize.language <- function (s) {
  s <- tolower(s)
  long <- nchar(s) == 5
  s[long] <- sub("^([a-z]{2})[-_][a-z]{2}$","\\1",s[long])
  s[nchar(s) != 2 & s != "c"] <- "unknown"
  s
}
canonicalize.language(c("aa","bb-cc","DD-abc","eee","ff_FF","C"))
[1] "aa"      "bb"      "unknown" "unknown" "ff"      "c"  
--8<---------------cut here---------------end--------------->8---

it does what I want it to do, but it takes 4.5 seconds on a vector of
length 10,256,341 - I wonder if I might be doing something aufully stupid.
I thought that sub() was slow, but my second attempt:
--8<---------------cut here---------------start------------->8---
canonicalize.language <- function (s) {
  s <- tolower(s)
  good <- nchar(s) == 5 & substr(s,3,3) %in% c("_","-")
  s[good] <- substr(s[good],1,2)
  s[nchar(s) != 2 & s != "c"] <- "unknown"
  s
}
--8<---------------cut here---------------end--------------->8---
was even slower (6.4 sec).

My two concerns are:

1. avoid allocating many small objects which are never collected
2. run fast

Which would be the best implementation?

Thanks a lot for your insight!
#
First thing to do is to run Rprof and see where the time is going;
here it is from my computer:

                      self.time self.pct total.time total.pct
tolower                    4.42    39.46       4.42     39.46
sub                        3.56    31.79       3.56     31.79
nchar                      1.54    13.75       1.54     13.75
canonicalize.language      0.62     5.54      11.14     99.46
!=                         0.52     4.64       0.52      4.64
==                         0.26     2.32       0.26      2.32
&                          0.22     1.96       0.22      1.96
gc                         0.06     0.54       0.06      0.54

more than half the time is in 'tolower' and 'nchar', so it is not all
'sub's problem.

This version runs a little faster since it does not need the 'tolower':

canonicalize.language <- function (s) {
  # s <- tolower(s)
  long <- nchar(s) == 5
  s[long] <- sub("^([[:alpha:]]{2})[-_][[:alpha:]]{2}$","\\1",s[long])
  s[nchar(s) != 2 & s != "c"] <- "unknown"
  s
}
On Fri, Sep 14, 2012 at 12:30 PM, Sam Steingold <sds at gnu.org> wrote:

  
    
#
aha, thanks!
but it does not convert "EN" to "en", so it is not good for my purposes.
#
You can alway convert to lower case afterwards with probably a shorter
vector.  You did not indicate that you needed that conversion; it only
looked like you did it for the regular expression.
On Fri, Sep 14, 2012 at 3:13 PM, Sam Steingold <sds at gnu.org> wrote: