What is the best package for large data cleaning (not statistical analysis)?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090314/57bab3a6/attachment-0002.pl>
Exactly what type of cleaning do you want to do on them?  Can you read
in the data a block at a time (e.g., 1M records), clean them up and
then write them back out?  You would have the choice of putting them
back as a text file or possibly storing them using 'filehash'.  I have
used that technique to segment a year's worth of data that was
probably 3GB of text into monthly objects that were about 70MB
dataframes that I stored using filehash.  These I then read back in to
do processing where I could summarize by month.  So it all depends on
what you want to do.

You could read in the chunks, clean them and then reshape them into
dataframes that you could process later.  You will still probably have
the problem that all the data still won't fit in memory.  Now one
thing I did was that since the dataframes were stored as binary
objects in filehash, it was pretty fast to retrieve them, pick out the
data I needed from each month and create a subset of just the data I
needed that would now fit in memory.

So it all depends ...........
Dear R helpers:

I am a newbie to R and have a question related to cleaning large data frames
in R.

So far, I have been using SAS for data cleaning because my data sets are
relatively large (handling multiple files, each could be as large as 5-10
G).
I am not a fan of SAS at all and am eager to move data cleaning tasks into R
completely.

Seems to me, there are 3 options. Using SQL, ff or filehash. I do not want
to learn sql. so my question is more related to ff and filehash.

In specifics,

(1) for merging two large data frames, ?which one is better, ff vs.
filehash?
(2) for reshaping a large data frame (say from long to wide or the opposite)
which one is better, ff vs. filehash?

If you can provide examples, that will be even better.

Many thanks in advance.

-Sean

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090315/3467d26a/attachment-0002.pl>
Hi Sean,

you should think about storing the data externally in a sql database.  
this makes you very flexible and you can do a lot of manipultaion  
directly in the db. with the help of stored procedures for example in  
a postgreSQL db you can use almost any preferred languege to  
manipulate the data before loading it into R. there's also a  
procedural language based on R with which you can do a lot of things  
already inside postgresql databases.

and keep in mind: learning sql isn't more difficult than R.

best,

josuah

Am 15.03.2009 um 13:13 schrieb Sean Zhang:
Dear Jim:

Thanks for your reply.
Looks to me, you were using batching.
I used batching to digest large data in Matlab before.
Still wonder the answers to the two specifics questions without  
resorting to
batching.

Thanks.

-Sean

On Sat, Mar 14, 2009 at 10:13 PM, jim holtman <jholtman at gmail.com>  
wrote:

Exactly what type of cleaning do you want to do on them?  Can you  
read
in the data a block at a time (e.g., 1M records), clean them up and
then write them back out?  You would have the choice of putting them
back as a text file or possibly storing them using 'filehash'.  I  
have
used that technique to segment a year's worth of data that was
probably 3GB of text into monthly objects that were about 70MB
dataframes that I stored using filehash.  These I then read back in  
to
do processing where I could summarize by month.  So it all depends on
what you want to do.

You could read in the chunks, clean them and then reshape them into
dataframes that you could process later.  You will still probably  
have
the problem that all the data still won't fit in memory.  Now one
thing I did was that since the dataframes were stored as binary
objects in filehash, it was pretty fast to retrieve them, pick out  
the
data I needed from each month and create a subset of just the data I
needed that would now fit in memory.

So it all depends ...........

On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang <seanecon at gmail.com>  
wrote:
Dear R helpers:

I am a newbie to R and have a question related to cleaning large  
data
frames
in R.

So far, I have been using SAS for data cleaning because my data  
sets are
relatively large (handling multiple files, each could be as large  
as 5-10
G).
I am not a fan of SAS at all and am eager to move data cleaning  
tasks
into R
completely.

Seems to me, there are 3 options. Using SQL, ff or filehash. I do  
not
want
to learn sql. so my question is more related to ff and filehash.

In specifics,

(1) for merging two large data frames,  which one is better, ff vs.
filehash?
(2) for reshaping a large data frame (say from long to wide or the
opposite)
which one is better, ff vs. filehash?

If you can provide examples, that will be even better.

Many thanks in advance.

-Sean

      [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html 

and provide commented, minimal, self-contained, reproducible code.

--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.