Back to formatted view
Raw Message

Message-ID: <BANLkTikt67oUWKEGQ=M_BY8_3rEG1OVxXQ@mail.gmail.com>
Date: 2011-05-25T12:12:23Z
From: Jonathan Daily
Subject: Processing large datasets
In-Reply-To: <4DDC858F.7060700@naumenko.ca>

In cases where I have to parse through large datasets that will not
fit into R's memory, I will grab relevant data using SQL and then
analyze said data using R. There are several packages designed to do
this, like [1] and [2] below, that allow you to query a database using
SQL and end up with that data in an R data.frame.

[1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html
[2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html

On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <roman at bestroman.com> wrote:
> Hi R list,
>
> I'm new to R software, so I'd like to ask about it is capabilities.
> What I'm looking to do is to run some statistical tests on quite big
> tables which are aggregated quotes from a market feed.
>
> This is a typical set of data.
> Each day contains millions of records (up to 10 non filtered).
>
> 2011-05-24 ? ? ?750 ? ? Bid ? ? DELL ? ?14130770 ? ? ? ?400
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 1 ? ? ? 0 ? ? ? 0
> 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130772 ? ? ? ?300
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0
> 2011-05-24 ? ? ?904 ? ? Bid ? ? DELL ? ?14130773 ? ? ? ?135
> 15.4800 ? ? ? ? BATS ? ?35482391 ? ? ? ?Y ? ? ? 1 ? ? ? 0 ? ? ? 0 ? ? ? 0
>
> I'll need to filter it out first based on some criteria.
> Since I keep it mysql database, it can be done through by query. Not
> super efficient, checked it already.
>
> Then I need to aggregate dataset into different time frames (time is
> represented in ms from midnight, like 35482391).
> Again, can be done through a databases query, not sure what gonna be faster.
> Aggregated tables going to be much smaller, like thousands rows per
> observation day.
>
> Then calculate basic statistic: mean, standard deviation, sums etc.
> After stats are calculated, I need to perform some statistical
> hypothesis tests.
>
> So, my question is: what tool faster for data aggregation and filtration
> on big datasets: mysql or R?
>
> Thanks,
> --Roman N.
>
> ? ? ? ?[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
===============================================
Jon Daily
Technician
===============================================
#!/usr/bin/env outside
# It's great, trust me.