Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
Thanks to Thomas, Martin, Jim and William, Your input was very informative, and thanks for the reference to Sedgwick. In the end, it does seem to me that all these algorithms require fast lookup by ID of nodes to access data, and that conditional on such fast lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) (depending on whether lookup time is constant or logarithmic). I believe my original algorithm achieves that. We come back to the fact that I assumed that R environments, implemented as hash tables, would give me that fast lookup. But on my systems, their efficiency (for insert and lookup) seems to degrade fast at several million entries. Certainly much faster than either O(1) or O(log(n)). I believe this does not have to do with disk access time. For example, I tested this on my desktop computer, running a pure hash insert loop. I observe 100% processor use but no disk access, as the size of the hash table approaches millions of entries. I have tested this on two systems, but have not gone into the implementation of the hashed environments to look at this in details. If others have the same (or different) experiences with using hashed environments with millions of entries, it would be very useful to know. Barring a solution to the hashed environment speed, it seems the way to speed this algorithm up (within the confines of R) would be to move away from hash tables and towards a numerically indexed array. Thanks again for all of the help, Magnus
On 11/4/2013 8:20 PM, Thomas Lumley wrote:
On Sat, Nov 2, 2013 at 11:12 AM, Martin Morgan <mtmorgan at fhcrc.org
<mailto:mtmorgan at fhcrc.org>> wrote:
On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
Sure,
I was attempting to be concise and boiling it down to what I saw
as the root
issue, but you are right, I could have taken it a step further.
So here goes.
I have a set of around around 20M string pairs. A given string
(say, A) can
either be equivalent to another string (B) or not. If A and B
occur together in
the same pair, they are equivalent. But equivalence is
transitive, so if A and B
occur together in one pair, and A and C occur together in
another pair, then A
and C are also equivalent. I need a way to quickly determine if
any two strings
from my data set are equivalent or not.
Do you mean that if A,B occur together and B,C occur together, then
A,B and A,C are equivalent?
Here's a function that returns a unique identifier (not well
tested!), allowing for transitive relations but not circularity.
uid <- function(x, y)
{
i <- seq_along(x) # global index
xy <- paste0(x, y) # make unique identifiers
idx <- match(xy, xy)
repeat {
## transitive look-up
y_idx <- match(y[idx], x) # look up 'y' in 'x'
keep <- !is.na <http://is.na>(y_idx)
if (!any(keep)) # no transitive
relations, done!
break
x[idx[keep]] <- x[y_idx[keep]]
y[idx[keep]] <- y[y_idx[keep]]
## create new index of values
xy <- paste0(x, y)
idx <- match(xy, xy)
}
idx
}
Values with the same index are identical. Some tests
> x <- c(1, 2, 3, 4)
> y <- c(2, 3, 5, 6)
> uid(x, y)
[1] 1 1 1 4
> i <- sample(x); uid(x[i], y[i])
[1] 1 1 3 1
> uid(as.character(x), as.character(y)) ## character() ok
[1] 1 1 1 4
> uid(1:10, 1 + 1:10)
[1] 1 1 1 1 1 1 1 1 1 1
> uid(integer(), integer())
integer(0)
> x <- c(1, 2, 3)
> y <- c(2, 3, 1)
> uid(x, y) ## circular!
C-c C-c
I think this will scale well enough, but the worst-case scenario can
be made to be log(longest chain) and copying can be reduced by using
an index i and subsetting the original vector on each iteration. I
think you could test for circularity by checking that the updated x
are not a permutation of the kept x, all(x[y_idx[keep]] %in% x[keep]))
Martin
This problem (union-find) is discussed in Chapter 1 of Sedgwick's
"Algorithms". There's an algorithm given that takes linear time to
build the structure, worst-case logarithmic time to query, and
effectively constant average time to query (inverse-Ackerman amortized
complexity).
-thomas
--
Thomas Lumley
Professor of Biostatistics
University of Auckland