Skip to content
Prev 332534 / 398502 Next

Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
Do you mean that if A,B occur together and B,C occur together, then A,B and A,C 
are equivalent?

Here's a function that returns a unique identifier (not well tested!), allowing 
for transitive relations but not circularity.

      uid <- function(x, y)
     {
         i <- seq_along(x)                   # global index
         xy <- paste0(x, y)                  # make unique identifiers
         idx <- match(xy, xy)

         repeat {
             ## transitive look-up
             y_idx <- match(y[idx], x)       # look up 'y' in 'x'
             keep <- !is.na(y_idx)
             if (!any(keep))                 # no transitive relations, done!
                 break
             x[idx[keep]] <- x[y_idx[keep]]
             y[idx[keep]] <- y[y_idx[keep]]

             ## create new index of values
             xy <- paste0(x, y)
             idx <- match(xy, xy)
         }
         idx
     }

Values with the same index are identical. Some tests

     > x <- c(1, 2, 3, 4)
     > y <- c(2, 3, 5, 6)
     > uid(x, y)
     [1] 1 1 1 4
     > i <- sample(x); uid(x[i], y[i])
     [1] 1 1 3 1
     > uid(as.character(x), as.character(y))  ## character() ok
     [1] 1 1 1 4
     > uid(1:10, 1 + 1:10)
      [1] 1 1 1 1 1 1 1 1 1 1
     > uid(integer(), integer())
     integer(0)
     > x <- c(1, 2, 3)
     > y <- c(2, 3, 1)
     > uid(x, y)                              ## circular!
       C-c C-c

I think this will scale well enough, but the worst-case scenario can be made to 
be log(longest chain) and copying can be reduced by using an index i and 
subsetting the original vector on each iteration. I think you could test for 
circularity by checking that the updated x are not a permutation of the kept x, 
all(x[y_idx[keep]] %in% x[keep]))

Martin