Skip to content

[RsR] [R] Median of streaming data

6 messages · Martin Maechler, Martyn Byng, Mohan Radhakrishnan +3 more

#

        
> On 24/09/14 17:31, Mohan Radhakrishnan wrote:
>> Hi,
    >> 
    >> I have streaming data(1 TB) that can't fit in memory. Is
    >> there a way for me to find the median of these streaming
    >> integers assuming I can fit only a small part in memory ?
    >> This is about the statistical approach to find the median
    >> of a large number of values when I can inspect only a
    >> part of them due to memory constraints.

    > You cannot, I'm pretty sure, calculate the median
    > recursively.  However there are "approximate" recursive
    > median algorithms which provide an estimate of location
    > that has the same asymptotic properties as the median.

    > See:

    > * U. Holst, Recursive estimators of location.
    > Commun. Statist. Theory Meth., vol. 16, 1987,
    > pp. 2201--2226.

    > and

    > * Murray A. Cameron and T. Rolf Turner, Recursive location
    > and scale estimators, Commun. Statist. Theory Meth.,
    > vol. 22, 1993, pp. 2503--2515.

This is really interesting to me, thank you, Rolf!

OTOH,

1) has your proposal ever been provided in R?
   I'd be happy to add it to the robustX
   (http://cran.ch.r-project.org/web/packages/robustX) or even
   robustbase (http://cran.ch.r-project.org/web/packages/robustbase) package.

2) Would anybody know of more recent research on the subject?
   (I quickly "googled around" and found research more geared
    for the time series situation which is more involved anyway)

   --> Hence CC'ing the experts' list  R-SIG-robust


Martin Maechler,  ETH Zurich


    > cheers,
    > Rolf Turner

    > -- 
    > Rolf Turner Technical Editor ANZJS
#
Something else that might be of interest ...

Zhang Q and Wang W (2007) A fast algorithm for approximate quantiles in high speed data streams Proceedings of the 19th International Conference on Scientific and Statistical Database Management IEEE Computer Society 29

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Martin Maechler
Sent: 24 September 2014 09:17
To: Rolf Turner
Cc: R-help at r-project.org; R-SIG-robust at r-project.org
Subject: Re: [R] Median of streaming data

        
> On 24/09/14 17:31, Mohan Radhakrishnan wrote:
>> Hi,
    >> 
    >> I have streaming data(1 TB) that can't fit in memory. Is
    >> there a way for me to find the median of these streaming
    >> integers assuming I can fit only a small part in memory ?
    >> This is about the statistical approach to find the median
    >> of a large number of values when I can inspect only a
    >> part of them due to memory constraints.

    > You cannot, I'm pretty sure, calculate the median
    > recursively.  However there are "approximate" recursive
    > median algorithms which provide an estimate of location
    > that has the same asymptotic properties as the median.

    > See:

    > * U. Holst, Recursive estimators of location.
    > Commun. Statist. Theory Meth., vol. 16, 1987,
    > pp. 2201--2226.

    > and

    > * Murray A. Cameron and T. Rolf Turner, Recursive location
    > and scale estimators, Commun. Statist. Theory Meth.,
    > vol. 22, 1993, pp. 2503--2515.

This is really interesting to me, thank you, Rolf!

OTOH,

1) has your proposal ever been provided in R?
   I'd be happy to add it to the robustX
   (http://cran.ch.r-project.org/web/packages/robustX) or even
   robustbase (http://cran.ch.r-project.org/web/packages/robustbase) package.

2) Would anybody know of more recent research on the subject?
   (I quickly "googled around" and found research more geared
    for the time series situation which is more involved anyway)

   --> Hence CC'ing the experts' list  R-SIG-robust


Martin Maechler,  ETH Zurich


    > cheers,
    > Rolf Turner

    > -- 
    > Rolf Turner Technical Editor ANZJS

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

________________________________________________________________________
This e-mail has been scanned for all viruses by Star.\ _...{{dropped:3}}
#
I meant the papers. I hit a paywall. Can we reconstruct the code from the
papers ?

Thanks,
Mohan
On Wed, Sep 24, 2014 at 3:59 PM, Martyn Byng <martyn.byng at nag.co.uk> wrote:

            

  
  
#
Two further interesting references might be
http://www.stat.cmu.edu/~ryantibs/papers/median.pdf
and
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6691602

Best
Roland

Am 24.09.2014 12:29, schrieb Martyn Byng:
#
Martin,

There's also the work of a former PhD student in our Dept:

http://arxiv.org/pdf/1007.1032.pdf

Matias
On 24/09/2014 1:16 AM, Martin Maechler wrote:
#
On 25/09/14 02:55, Mohan Radhakrishnan wrote:
Shouldn't be too tough; the algorithm (in the Cameron and Turner paper) 
is actually pretty simple.

I intend to code it up again; could not find any trace of the old 
(Splus?) code in my chaotic archives.

Should be able to get it done "real soon now".

cheers,

Rolf