Back to formatted view
Raw Message

Message-ID: <F023612C-3AEA-44D1-A8E4-7B0D671B8C86@gmail.com>
Date: 2012-11-19T07:57:35Z
From: Dr Gregory Jefferis
Subject: lazy load (and unload?) elements of large list
In-Reply-To: <CAJDot1pGe-dYxCfSO+kQuzj0OdotGboaZ3i4e+nMVRh=u7psRg@mail.gmail.com>

Dear Saptarshi,

Thanks for your response.

On 17 Nov 2012, at 05:42, Saptarshi Guha wrote:

> One question: how large is each list element? < 256MB?

fairly small (<5Mb)

> One approach: store your data in Hbase (or Hadoop MapFiles), with key
> as the list index.

Can Hbase cope with arbitrary format data (like a list) rather than just tables?
> 
> The define an object O of class C. Redefine "[[".C as a function that
> reads HBase/HDFS queries for the list index i (e.g. as in x[[i]]) and
> retrieve the i'th list element.

Yes, that was the sort of thing I had in mind.

> Cache this, so the second x[[i]] is
> called, it will retreive it from the cache. To prevent the cache from
> expanding to 2GB, you can keep last K cache entries (some MRU/LRU type
> cache retention scheme).

I was hoping someone might have done some work on this type of strategy that could be reused!

> Not sure how you intend to use this list, the application above
> handles the query of some subset of keys (indices 1...16K). 
> Do you want to run some function F across all/large subset of keys?
> This is a good case for R and Hadoop.

I have 2 use cases ? one batch processing on a cluster where all elements will be processed, typically in a function that takes a pair of elements as inputs and computes scores for all 16k^2 combinations; this is much faster when. For the time being simple approaches seem to work here, but Hadoop is something I have been meaning to investigate for the longer term.  The second main use case, interactive use on single user machine, where I am more concerned about saturating memory, is what I am trying to address currently.

Best wishes,

Greg.