Skip to content
Prev 51715 / 63421 Next

match and unique

Hi Terry,
On 03/16/2016 08:03 AM, Therneau, Terry M., Ph.D. wrote:
This is assuming that match() and unique() will never disagree on
equality between 2 floating point values. I believe they share some
code internally (same hashing routine?), so maybe it's reliable.

Anyway, it's always preferable to not rely on this kind of assumption.

A safer thing to do is to use rank():

   r <- rank(x, ties.method="min")  # could use "max"

Think of 'r' as a unique ID assigned to each value in 'x'. This ID takes
its values in the (1,length(x)) range but we want it to take its values
in the (1,length(unique(x))) range:

   ID_remapping <- cumsum(tabulate(r, nbins=length(r)) != 0L)
   index <- ID_remapping[r]

'index' will be the same as 'match(x, sort(unique(x))' but doesn't rely
on the assumption that match() and unique() agree on equality between
2 floating point values.

Unfortunately rank() is very slow, much slower than sort(). Here is a
faster solution based on sort.list(x, na.last=NA, method="quick"):

   assignID <- function(x)
   {
     oo <- sort.list(x, na.last=NA, method="quick")
     sorted <- x[oo]
     is_unique <- c(TRUE, sorted[-length(sorted)] != sorted[-1L])
     sorted_ID <- cumsum(is_unique)
     ID <- integer(length(x))
     ID[oo] <- sorted_ID
     ID
   }

'assignID(x)' is also slightly faster than 'match(x, sort(unique(x)))':

   x <- runif(5000000)
   system.time(index1 <- match(x, sort(unique(x))))
   #   user  system elapsed
   #  2.170   0.552   2.725

   system.time(index2 <- assignID(x))
   #   user  system elapsed
   #  0.885   0.032   0.917

   identical(index1, index2)
   # [1] TRUE

Cheers,
H.