Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Fri, Nov 8, 2013 1:34 AM

Thanks to Thomas, Martin, Jim and William,

Your input was very informative, and thanks for the reference to Sedgwick.

In the end, it does seem to me that all these algorithms require fast 
lookup by ID of nodes to access data, and that conditional on such fast 
lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) 
(depending on whether lookup time is constant or logarithmic). I believe 
my original algorithm achieves that.

We come back to the fact that I assumed that R environments, implemented 
as hash tables, would give me that fast lookup. But on my systems, their 
efficiency (for insert and lookup) seems to degrade fast at several 
million entries. Certainly much faster than either O(1) or O(log(n)). I 
believe this does not have to do with disk access time. For example, I 
tested this on my desktop computer, running a pure hash insert loop. I 
observe 100% processor use but no disk access, as the size of the hash 
table approaches millions of entries.

I have tested this on two systems, but have not gone into the 
implementation of the hashed environments to look at this in details. If 
others have the same (or different) experiences with using hashed 
environments with millions of entries, it would be very useful to know.

Barring a solution to the hashed environment speed, it seems the way to 
speed this algorithm up (within the confines of R) would be to move away 
from hash tables and towards a numerically indexed array.

Thanks again for all of the help,
Magnus

On 11/4/2013 8:20 PM, Thomas Lumley wrote:

On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <mtmorgan at fhcrc.org
<mailto:mtmorgan at fhcrc.org>> wrote:

    On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:

        Sure,

        I was attempting to be concise and boiling it down to what I saw
        as the root
        issue, but you are right, I could have taken it a step further.
        So here goes.

        I have a set of around around 20M string pairs. A given string
        (say, A) can
        either be equivalent to another string (B) or not. If A and B
        occur together in
        the same pair, they are equivalent. But equivalence is
        transitive, so if A and B
        occur together in one pair, and A and C occur together in
        another pair, then A
        and C are also equivalent. I need a way to quickly determine if
        any two strings
        from my data set are equivalent or not.


    Do you mean that if A,B occur together and B,C occur together, then
    A,B and A,C are equivalent?

    Here's a function that returns a unique identifier (not well
    tested!), allowing for transitive relations but not circularity.

          uid <- function(x, y)
         {
             i <- seq_along(x)                   # global index
             xy <- paste0(x, y)                  # make unique identifiers
             idx <- match(xy, xy)

             repeat {
                 ## transitive look-up
                 y_idx <- match(y[idx], x)       # look up 'y' in 'x'
                 keep <- !is.na <http://is.na>(y_idx)
                 if (!any(keep))                 # no transitive
    relations, done!
                     break
                 x[idx[keep]] <- x[y_idx[keep]]
                 y[idx[keep]] <- y[y_idx[keep]]

                 ## create new index of values
                 xy <- paste0(x, y)
                 idx <- match(xy, xy)
             }
             idx
         }

    Values with the same index are identical. Some tests

         > x <- c(1, 2, 3, 4)
         > y <- c(2, 3, 5, 6)
         > uid(x, y)

         [1] 1 1 1 4

         > i <- sample(x); uid(x[i], y[i])

         [1] 1 1 3 1

         > uid(as.character(x), as.character(y))  ## character() ok

         [1] 1 1 1 4

         > uid(1:10, 1 + 1:10)

          [1] 1 1 1 1 1 1 1 1 1 1

         > uid(integer(), integer())

         integer(0)

         > x <- c(1, 2, 3)
         > y <- c(2, 3, 1)
         > uid(x, y)                              ## circular!

           C-c C-c

    I think this will scale well enough, but the worst-case scenario can
    be made to be log(longest chain) and copying can be reduced by using
    an index i and subsetting the original vector on each iteration. I
    think you could test for circularity by checking that the updated x
    are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))

    Martin



This problem (union-find) is discussed in Chapter 1 of Sedgwick's
"Algorithms".  There's an algorithm given that takes linear time to
build the structure, worst-case logarithmic time to query, and
effectively constant average time to query (inverse-Ackerman amortized
complexity).

    -thomas

--
Thomas Lumley
Professor of Biostatistics
University of Auckland

Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Thread (9 messages)