Message-ID: <Pine.LNX.4.43.0903240105350.11646@hymn11.u.washington.edu>
Date: 2009-03-24T08:05:35Z
From: Thomas Lumley
Subject: read in large data file (tsv) with inline filter?
In-Reply-To: <fd913b0d0903231453j214fe992lfc82ef95ac47b566@mail.gmail.com>
On Mon, 23 Mar 2009, David Reiss wrote:
> I have a very large tab-delimited file, too big to store in memory via
> readLines() or read.delim(). Turns out I only need a few hundred of those
> lines to be read in. If it were not so large, I could read the entire file
> in and "grep" the lines I need. For such a large file; many calls to
> read.delim() with incrementing "skip" and "nrows" parameters, followed by
> grep() calls is very slow.
You certainly don't want to use repeated reads from the start of the file with skip=, but if you set up a file connection
fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.
The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle