Back to formatted view
Raw Message

Message-ID: <Pine.LNX.4.43.0903240105350.11646@hymn11.u.washington.edu>
Date: 2009-03-24T08:05:35Z
From: Thomas Lumley
Subject: read in large data file (tsv) with inline filter?
In-Reply-To: <fd913b0d0903231453j214fe992lfc82ef95ac47b566@mail.gmail.com>

On Mon, 23 Mar 2009, David Reiss wrote:

> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls to
> read.delim() with incrementing "skip" and "nrows" parameters, followed by
> grep() calls is very slow.

You certainly don't want to use repeated reads from the start of the file with skip=,  but if you set up a file connection
    fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.

The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.

       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle