Skip to content
Prev 351379 / 398502 Next

Converting unique strings to unique numbers

On Fri, May 29, 2015 at 2:16 PM, Herv? Pag?s <hpages at fredhutch.org> wrote:
Hm. I hadn't thought of that approach - I use the
as.numeric(factor(...)) approach.

So I was curious, and compared the two:


set.seed(43)
x <- sample(letters, 10000, replace=TRUE)

system.time({
  for(i in seq_len(20000)) {
  ids1 <- match(x, x)
}})

#   user  system elapsed
#  9.657   0.000   9.657

system.time({
  for(i in seq_len(20000)) {
  ids2 <- as.numeric(factor(x, levels=letters))
}})

#   user  system elapsed
#   6.16    0.00    6.16

Using factor() is faster. More importantly, using factor() lets you
set the order of the indices in an expected fashion, where match()
assigns them in the order of occurrence.

head(data.frame(x, ids1, ids2))

  x ids1 ids2
1 m    1   13
2 x    2   24
3 b    3    2
4 s    4   19
5 i    5    9
6 o    6   15

In a problem like Kate's where there are several columns for which the
same ordering of indices is desired, that becomes really important.

If you take Bill Dunlap's modification of the match() approach, it
resolves both problems: matching against the pooled unique values is
both faster than the factor() version and gives the same result:
On Fri, May 29, 2015 at 1:31 PM, William Dunlap <wdunlap at tibco.com> wrote:
f <- function (data)
{
    uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
    uniqStrings <- setdiff(uniqStrings, "0")
    for (j in 2:4) {
        data[[j]] <- match(data[[j]], uniqStrings, nomatch = 0L)
    }
    data
}

##

y <- data.frame(id = 1:5000, v1 = sample(letters, 5000, replace=TRUE),
v2 = sample(letters, 5000, replace=TRUE), v3 = sample(letters, 5000,
replace=TRUE), stringsAsFactors=FALSE)


system.time({
  for(i in seq_len(20000)) {
    ids3 <- f(data.frame(y))
}})

#   user  system elapsed
# 22.515   0.000  22.518



ff <- function(data)
{
    uniqStrings <- unique(c(data[, 2], data[, 3], data[, 4]))
    uniqStrings <- setdiff(uniqStrings, "0")
    for (j in 2:4) {
        data[[j]] <- as.numeric(factor(data[[j]], levels=uniqStrings))
    }
    data
}

system.time({
  for(i in seq_len(20000)) {
    ids4 <- ff(data.frame(y))
}})

#    user  system elapsed
#  26.083   0.002  26.090

head(ids3)

  id v1 v2 v3
1  1  1  2  8
2  2  2 19 22
3  3  3 21 16
4  4  4 10 17
5  5  1  8 18
6  6  1 12 26

head(ids4)

  id v1 v2 v3
1  1  1  2  8
2  2  2 19 22
3  3  3 21 16
4  4  4 10 17
5  5  1  8 18
6  6  1 12 26

Kate, if you're getting all zeros, check str(yourdataframe) - it's
likely that when you imported your data into R the strings were
already converted to factors, which is not what you want (ask me how I
know this!).

Sarah