Skip to content
Prev 332848 / 398502 Next

Inserting 17M entries into env took 18h, inserting 34M entries taking 5+ days

Thanks to Thomas, Martin, Jim and William,

Your input was very informative, and thanks for the reference to Sedgwick.

In the end, it does seem to me that all these algorithms require fast 
lookup by ID of nodes to access data, and that conditional on such fast 
lookup, algorithms are possible with efficiency O(n) or O(n*log(n)) 
(depending on whether lookup time is constant or logarithmic). I believe 
my original algorithm achieves that.

We come back to the fact that I assumed that R environments, implemented 
as hash tables, would give me that fast lookup. But on my systems, their 
efficiency (for insert and lookup) seems to degrade fast at several 
million entries. Certainly much faster than either O(1) or O(log(n)). I 
believe this does not have to do with disk access time. For example, I 
tested this on my desktop computer, running a pure hash insert loop. I 
observe 100% processor use but no disk access, as the size of the hash 
table approaches millions of entries.

I have tested this on two systems, but have not gone into the 
implementation of the hashed environments to look at this in details. If 
others have the same (or different) experiences with using hashed 
environments with millions of entries, it would be very useful to know.

Barring a solution to the hashed environment speed, it seems the way to 
speed this algorithm up (within the confines of R) would be to move away 
from hash tables and towards a numerically indexed array.

Thanks again for all of the help,
Magnus
On 11/4/2013 8:20 PM, Thomas Lumley wrote: