Skip to content
Prev 173762 / 398503 Next

What is the best package for large data cleaning (not statistical analysis)?

Exactly what type of cleaning do you want to do on them?  Can you read
in the data a block at a time (e.g., 1M records), clean them up and
then write them back out?  You would have the choice of putting them
back as a text file or possibly storing them using 'filehash'.  I have
used that technique to segment a year's worth of data that was
probably 3GB of text into monthly objects that were about 70MB
dataframes that I stored using filehash.  These I then read back in to
do processing where I could summarize by month.  So it all depends on
what you want to do.

You could read in the chunks, clean them and then reshape them into
dataframes that you could process later.  You will still probably have
the problem that all the data still won't fit in memory.  Now one
thing I did was that since the dataframes were stored as binary
objects in filehash, it was pretty fast to retrieve them, pick out the
data I needed from each month and create a subset of just the data I
needed that would now fit in memory.

So it all depends ...........
On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang <seanecon at gmail.com> wrote: