Skip to content

Enormous Datasets

3 messages · Thomas W Volscho, Peter Dalgaard, Roger D. Peng

#
Dear List,
I have some projects where I use enormous datasets.  For instance, the 5% PUMS microdata from the Census Bureau.  After deleting cases I may have a dataset with 7 million+ rows and 50+ columns.  Will R handle a datafile of this size?  If so, how?

Thank you in advance,
Tom Volscho

************************************        
Thomas W. Volscho
Graduate Student
Dept. of Sociology U-2068
University of Connecticut
Storrs, CT 06269
Phone: (860) 486-3882
http://vm.uconn.edu/~twv00001
#
Thomas W Volscho <THOMAS.VOLSCHO at huskymail.uconn.edu> writes:
With a big machine... If that is numeric, non-integer data, you are
looking at something like
[1] 2.8e+09

i.e. roughly 3 GB of data for one copy of the data set. You easily
find yourself with multiple copies, so I suppose a machine with 16GB
RAM would cut it. These days that basically suggests x86_64
architecture running Linux (e.g. multiprocessor Opterons), but there
are also 64 bit Unix "big iron" solutions (Sun, IBM, HP,...).

If you can avoid dealing with the whole dataset at once, smaller
machines might get you there. Notice that 1 column is "only" 56MB, and
you may be able to work with aggregated data from some step onwards.
#
It depends on what you mean by 'handle', but probably not.  You'll 
likely have to split the file into multiple files unless you have some 
rather high end hardware.   However, in my limited experience, there's 
almost always a meaningful way to split the data (geographically, or 
by other categories).

A few things I've learned recently working with large datasets:

1.  Store files in .rda format using save() -- the load times are much 
faster and loading takes up less memory
2.  If your data are integers, store them as integers!
3.  Don't store character variables in dataframes -- use factors

-roger
Thomas W Volscho wrote: