Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days
On 11/01/2013 08:22 AM, Magnus Thor Torfason wrote:
Sure, I was attempting to be concise and boiling it down to what I saw as the root issue, but you are right, I could have taken it a step further. So here goes. I have a set of around around 20M string pairs. A given string (say, A) can either be equivalent to another string (B) or not. If A and B occur together in the same pair, they are equivalent. But equivalence is transitive, so if A and B occur together in one pair, and A and C occur together in another pair, then A and C are also equivalent. I need a way to quickly determine if any two strings from my data set are equivalent or not.
Do you mean that if A,B occur together and B,C occur together, then A,B and A,C
are equivalent?
Here's a function that returns a unique identifier (not well tested!), allowing
for transitive relations but not circularity.
uid <- function(x, y)
{
i <- seq_along(x) # global index
xy <- paste0(x, y) # make unique identifiers
idx <- match(xy, xy)
repeat {
## transitive look-up
y_idx <- match(y[idx], x) # look up 'y' in 'x'
keep <- !is.na(y_idx)
if (!any(keep)) # no transitive relations, done!
break
x[idx[keep]] <- x[y_idx[keep]]
y[idx[keep]] <- y[y_idx[keep]]
## create new index of values
xy <- paste0(x, y)
idx <- match(xy, xy)
}
idx
}
Values with the same index are identical. Some tests
> x <- c(1, 2, 3, 4)
> y <- c(2, 3, 5, 6)
> uid(x, y)
[1] 1 1 1 4
> i <- sample(x); uid(x[i], y[i])
[1] 1 1 3 1
> uid(as.character(x), as.character(y)) ## character() ok
[1] 1 1 1 4
> uid(1:10, 1 + 1:10)
[1] 1 1 1 1 1 1 1 1 1 1
> uid(integer(), integer())
integer(0)
> x <- c(1, 2, 3)
> y <- c(2, 3, 1)
> uid(x, y) ## circular!
C-c C-c
I think this will scale well enough, but the worst-case scenario can be made to
be log(longest chain) and copying can be reduced by using an index i and
subsetting the original vector on each iteration. I think you could test for
circularity by checking that the updated x are not a permutation of the kept x,
all(x[y_idx[keep]] %in% x[keep]))
Martin
The way I do this currently is to designate the smallest (alphabetically) string
in each known equivalence set as the "main" entry. For each pair, I therefore
insert two entries into the hash table, both pointing at the mail value. So
assuming the input data:
A,B
B,C
D,E
I would then have:
A->A
B->A
C->B
D->D
E->D
Except that I also follow each chain until I reach the end (key==value), and
insert pointers to the "main" value for every value I find along the way. After
doing that, I end up with:
A->A
B->A
C->A
D->D
E->D
And I can very quickly check equivalence, either by comparing the hash of two
strings, or simply by transforming each string into its hash, and then I can use
simple comparison from then on. The code for generating the final hash table is
as follows:
h : Empty hash table created with hash.new()
d : Input data
hash.deep.get : Function that iterates through the hash table until it finds a
key whose value is equal to itself (until hash.get(X)==X), then returns all the
values in a vector
h = hash.new()
for ( i in 1:nrow(d) )
{
deep.a = hash.deep.get(h, d$a[i])
deep.b = hash.deep.get(h, d$b[i])
equivalents = sort(unique(c(deep.a,deep.b)))
equiv.id = min(equivalents)
for ( equivalent in equivalents )
{
hash.put(h, equivalent, equiv.id)
}
}
I would so much appreciate if there was a simpler and faster way to do this.
Keeping my fingers crossed that one of the R-help geniuses who sees this is
sufficiently interested to crack the problem
Best,
Magnus
On 11/1/2013 1:49 PM, jim holtman wrote:
It would be nice if you followed the posting guidelines and at least showed the script that was creating your entries now so that we understand the problem you are trying to solve. A bit more explanation of why you want this would be useful. This gets to the second part of my tag line: Tell me what you want to do, not how you want to do it. There may be other solutions to your problem. Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. On Fri, Nov 1, 2013 at 9:32 AM, Magnus Thor Torfason <zulutime.net at gmail.com> wrote:
Pretty much what the subject says: I used an env as the basis for a Hashtable in R, based on information that this is in fact the way environments are implemented under the hood. I've been experimenting with doubling the number of entries, and so far it has seemed to be scaling more or less linearly, as expected. But as I went from 17 million entries to 34 million entries, the completion time has gone from 18 hours, to 5 days and counting. The keys and values are in all cases strings of equal length. One might suspect that the slow-down might have to do with the memory being swapped to disk, but from what I know about my computing environment, that should not be the case. So my first question: Is anyone familiar with anything in the implementation of environments that would limit their use or slow them down (faster than O(nlog(n)) as the number of entries is increased? And my second question: I realize that this is not strictly what R environments were designed for, but this is what my algorithm requires: I must go through these millions of entries, storing them in the hash table and sometimes retrieving them along the way, in a more or less random manner, which is contingent on the data I am encountering, and on the contents of the hash table at each moment. Does anyone have a good recommendation for alternatives to implement huge, fast, table-like structures in R? Best, Magnus
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793