Skip to content

read in large data file (tsv) with inline filter?

6 messages · David Reiss, Dylan Beaudette, Thomas Lumley +2 more

#
On Monday 23 March 2009, David Reiss wrote:
How about pre-filtering before loading the data into R:

grep -E 'your pattern here' your_file_here > your_filtered_file

alternatively if you need to search in fields, see 'awk', and 'cut', or if you 
need to delete things see 'tr'.

These tools come with any unix-like OS, and you can probably get them on 
windows without much effort.


Cheers,
Dylan
#
On Mon, 23 Mar 2009, David Reiss wrote:

            
You certainly don't want to use repeated reads from the start of the file with skip=,  but if you set up a file connection
    fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.

The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.

       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle
#
On Mon, Mar 23, 2009 at 5:53 PM, David Reiss <dreiss at systemsbiology.org> wrote:
Why the restriction?  Using sqldf with sqlite its only
two R statements.

The first statement defines the name of the file and the
second statement defines what you want to extract from
it and its format.  First create a sample file using built in
data frame BOD and lets specify tab delimiters.  A header,
i.e. col.names = TRUE, is the default for write.table:

# create tab delimited test file with headers
write.table(BOD, file = "myfile.dat", sep = "\t")

# now read in only those records satisfying a condition
library(sqldf)

# here are the two statements
myfile <- file("myfile.dat")
DF <- sqldf("select * from myfile where demand < 10 or demand > 15",
  file.format = list(sep = "\t"))

See examples on the home page http://sqldf.googlecode.com and
see ?sqldf for info on the file.format argument in case your format differs.
#
Hi David,
On 23 March 2009 at 15:09, Dylan Beaudette wrote:
| On Monday 23 March 2009, David Reiss wrote:
| > I have a very large tab-delimited file, too big to store in memory via
| > readLines() or read.delim(). Turns out I only need a few hundred of those
| > lines to be read in. If it were not so large, I could read the entire file
| > in and "grep" the lines I need. For such a large file; many calls to
| > read.delim() with incrementing "skip" and "nrows" parameters, followed by
| > grep() calls is very slow. I am aware of possibilities via SQLite; I would
| > prefer to not use that in this case.
| >
| > My question is...Is there a function for efficiently reading in a file
| > along the lines of read.delim(), which allows me to specify a filter (via
| > grep or something else) that tells the function to only read in certain
| > lines that match?
| >
| > If not, I would *love* to see a "filter" parameter added as an option to
| > read.delim() and/or readLines().
| 
| How about pre-filtering before loading the data into R:
| 
| grep -E 'your pattern here' your_file_here > your_filtered_file
| 
| alternatively if you need to search in fields, see 'awk', and 'cut', or if you 
| need to delete things see 'tr'.
| 
| These tools come with any unix-like OS, and you can probably get them on 
| windows without much effort.

Also note that read.delim() and friends all read from connections, and 'piped
expressions' (in the Unix shell command sense) can provide a source.

That way you can build an ad-hoc filter extension by running readLines() over
a pipe() connection.  Consider this trivial example of grepping out Section
headers from the R FAQ.  We get everything double because of the Table of
Contents and the actual section headers:

R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}' src/debian/R/R-alpha.20090320/doc/FAQ") )
 [1] "1 Introduction " "2 R Basics"      "3 R and"         "4 R Web"        
 [5] "5 R Add-On"      "6 R and"         "7 R Miscellanea" "8 R Programming"
 [9] "9 R Bugs"        "1 Introduction " "2 R Basics"      "3 R and"        
[13] "4 R Web"         "5 R Add-On"      "6 R and"         "7 R Miscellanea"
[17] "8 R Programming" "9 R Bugs"       
R> 

The regexp is simply 'digits at start of line followed by space' which skips
subsections like 1.1, 1.2, ...

Hth, Dirk