Skip to content

Dealing With Extremely Large Files

6 messages · Charles C. Berry, jim holtman, Gabor Grothendieck +1 more

#
Hi,
I'm sure that a large fixed width file, such as 300 million rows and 1,000
columns, is too large for R to handle on a PC, but are there ways to deal
with it?

For example, is there a way to combine some sampling method with read.fwf so
that you can read in a sample of 100,000 records, for example?

Something like this may make analysis possible.

Once analyzed, is there a way to, say, read in only x rows at a time, save
and score each subset separately, and finally append them back together?

I haven't seen any information on this, if it is possible.  Thank you for
reading, and sorry if the information was easily available and I simply
didn't find it.
#
Try

 	RSiteSearch("biglm")

for some threads that discuss strategy for analyzing big datasets.

HTH,

Chuck
On Fri, 26 Sep 2008, zerfetzen wrote:

            
Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901
#
You can always setup a "connection" and then read in the number of
lines you need for the analysis, write out the results and then read
in the next ones.  I have also used 'filehash' to initially read in
portions of a file and then write the objects into the database.
These are quickly retrieved if I want to make subsequent passes
through the data.

A 100,000 rows will also probably tax your machine since if these are
numeric, you will need 800MB to store a since copy of the object and
you will probably need 3-4X that amount (a total of 4GB of physical
memory) if you are doing any processing that might make copies.
Hopefully you are running on a 64-bit system with lots of memory.
On Fri, Sep 26, 2008 at 3:55 PM, zerfetzen <zerfetzen at yahoo.com> wrote:

  
    
#
Not sure if it applies to your file or not but if it does then the
sqldf package facilitates reading a large file into an SQLite database.
Its a front end to RSQLite which is a front end to SQLite and it
reads the data straight into the database without going through
R so R does not limit it in any way -- its only actuated from R.
The code to do this is basically just two lines of code.  You don;t
have to install database software (its included with RSQLite package)
and you don't have to set up a database at all -- it does that for you
automatically.

See example 6e on the home page which creates a database
transparently, reads in the data and extracts random rows from
the database into R:
http://sqldf.googlecode.com
On Fri, Sep 26, 2008 at 3:55 PM, zerfetzen <zerfetzen at yahoo.com> wrote:
3 days later
#
Thank you Gabor, this is fantastic, easy to use and so powerful.  I was
instantly able to many things with .csv files that are much too large for my
PC's memory.  This is clearly my new favorite way to read in data, I love
it!

Is it possible to use sqldf with a fixed width format that requires a file
layout?

For example, let's say you have a .dat file called madeup.dat, without a
header row.  The hypothetical file madeup.dat for discussion has 3 variables
(state, zipcode, and score), is 10 characters wide, and has 20 rows (again,
just a made-up file).

Here is my fumbling attempt at code that will read in only state and score,
and randomly select 10 obs:

library(sqldf)

# Source pulls in the development version of sqldf.
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")

#Open a connection to that file.
MyConnection <- file("madeup.dat")

# Read in only state and score variables, and randomly select only 10 rows.
MyData <- sqldf("select state,score from MyConnection order by random(*)
limit 10")

# I think everything about this would work, except it should not currently
know which
# columns are to be brought in for the state variable (which would be 1-2),
and that
# the text columns for zipcode (3-7) should be ignored, and finally that
score (text
# columns 8-10) should be included again.  If I have overlooked this, I
apologize.
# Thank you.
#
There are no built in facilties for fixed column widths but its not hard to
parse out the fields yourself using the sqlite substr function.

I've added example 6f to the sqldf home page which illustrates this.

http://sqldf.googlecode.com
On Tue, Sep 30, 2008 at 5:18 PM, zerfetzen <zerfetzen at yahoo.com> wrote: