Enormous Datasets
Thomas W Volscho <THOMAS.VOLSCHO at huskymail.uconn.edu> writes:
Dear List, I have some projects where I use enormous datasets. For instance, the 5% PUMS microdata from the Census Bureau. After deleting cases I may have a dataset with 7 million+ rows and 50+ columns. Will R handle a datafile of this size? If so, how?
With a big machine... If that is numeric, non-integer data, you are looking at something like
7e6*50*8
[1] 2.8e+09 i.e. roughly 3 GB of data for one copy of the data set. You easily find yourself with multiple copies, so I suppose a machine with 16GB RAM would cut it. These days that basically suggests x86_64 architecture running Linux (e.g. multiprocessor Opterons), but there are also 64 bit Unix "big iron" solutions (Sun, IBM, HP,...). If you can avoid dealing with the whole dataset at once, smaller machines might get you there. Notice that 1 column is "only" 56MB, and you may be able to work with aggregated data from some step onwards.
O__ ---- Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907