lazy load (and unload?) elements of large list
Dear Saptarshi, Thanks for your response.
On 17 Nov 2012, at 05:42, Saptarshi Guha wrote:
One question: how large is each list element? < 256MB?
fairly small (<5Mb)
One approach: store your data in Hbase (or Hadoop MapFiles), with key as the list index.
Can Hbase cope with arbitrary format data (like a list) rather than just tables?
The define an object O of class C. Redefine "[[".C as a function that reads HBase/HDFS queries for the list index i (e.g. as in x[[i]]) and retrieve the i'th list element.
Yes, that was the sort of thing I had in mind.
Cache this, so the second x[[i]] is called, it will retreive it from the cache. To prevent the cache from expanding to 2GB, you can keep last K cache entries (some MRU/LRU type cache retention scheme).
I was hoping someone might have done some work on this type of strategy that could be reused!
Not sure how you intend to use this list, the application above handles the query of some subset of keys (indices 1...16K). Do you want to run some function F across all/large subset of keys? This is a good case for R and Hadoop.
I have 2 use cases ? one batch processing on a cluster where all elements will be processed, typically in a function that takes a pair of elements as inputs and computes scores for all 16k^2 combinations; this is much faster when. For the time being simple approaches seem to work here, but Hadoop is something I have been meaning to investigate for the longer term. The second main use case, interactive use on single user machine, where I am more concerned about saturating memory, is what I am trying to address currently. Best wishes, Greg.