Skip to content
Prev 351382 / 398502 Next

Converting unique strings to unique numbers

Hi Sarah,
On 05/29/2015 12:04 PM, Sarah Goslee wrote:
That's an unfair comparison, because you already know what the levels
are so you can supply them to your call to factor(). Most of the time
you don't know what the levels are so either you just do factor(x) and
let the factor() constructor compute the levels for you, or you compute
them yourself upfront with something like factor(x, levels=unique(x)).

   library(microbenchmark)

   microbenchmark(
     {ids1 <- match(x, x)},
     {ids2 <- as.integer(factor(x, levels=letters))},
     {ids3 <- as.integer(factor(x))},
     {ids4 <- as.integer(factor(x, levels=unique(x)))}
   )
   Unit: microseconds
                                                       expr     min       lq
                                {     ids1 <- match(x, x) } 245.979 262.2390
    {     ids2 <- as.integer(factor(x, levels = letters)) } 214.115 219.2320
                      {     ids3 <- as.integer(factor(x)) } 380.782 388.7295
  {     ids4 <- as.integer(factor(x, levels = unique(x))) } 332.250 342.6630
        mean   median      uq     max neval
    267.3210 264.4845 268.348 293.894   100
    226.9913 220.9870 226.147 314.875   100
    402.2242 394.7165 412.075 481.410   100
    349.7405 345.3090 353.162 383.002   100
I'm not sure why which particular ID gets assigned to each string would
matter but maybe I'm missing something. What really matters is that each
string receives a unique ID. match(x, x) does that.

In Kate's problem, where the strings are in more than one column,
and you want the ID to be unique across the columns, you need to do
match(x, x) where 'x' contains the strings from all the columns
that you want to replace:

   m <- matrix(c(
     "X0001", "BYX859",        0,        0,  2,  1, "BYX859",
     "X0001", "BYX894",        0,        0,  1,  1, "BYX894",
     "X0001", "BYX862", "BYX894", "BYX859",  2,  2, "BYX862",
     "X0001", "BYX863", "BYX894", "BYX859",  2,  2, "BYX863",
     "X0001", "BYX864", "BYX894", "BYX859",  2,  2, "BYX864",
     "X0001", "BYX865", "BYX894", "BYX859",  2,  2, "BYX865"
   ), ncol=7, byrow=TRUE)

   x <- m[ , 2:4]
   id <- match(x, x, nomatch=0, incomparables="0")
   m[ , 2:4] <- id

No factor needed. No loop needed. ;-)

Cheers,
H.