read in large data file (tsv) with inline filter? - R-help

Mon, Mar 23, 2009 2:53 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090323/ee4a8a2c/attachment-0002.pl>

Dylan Beaudette

Mon, Mar 23, 2009 3:09 PM #

On Monday 23 March 2009, David Reiss wrote:

How about pre-filtering before loading the data into R:

grep -E 'your pattern here' your_file_here > your_filtered_file

alternatively if you need to search in fields, see 'awk', and 'cut', or if you 
need to delete things see 'tr'.

These tools come with any unix-like OS, and you can probably get them on 
windows without much effort.


Cheers,
Dylan

Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341

Thomas Lumley

Tue, Mar 24, 2009 1:05 AM #

On Mon, 23 Mar 2009, David Reiss wrote:

You certainly don't want to use repeated reads from the start of the file with skip=,  but if you set up a file connection
    fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.

The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.

       -thomas

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle

Gabor Grothendieck

Tue, Mar 24, 2009 4:45 AM #

On Mon, Mar 23, 2009 at 5:53 PM, David Reiss <dreiss at systemsbiology.org> wrote:

Why the restriction?  Using sqldf with sqlite its only
two R statements.

The first statement defines the name of the file and the
second statement defines what you want to extract from
it and its format.  First create a sample file using built in
data frame BOD and lets specify tab delimiters.  A header,
i.e. col.names = TRUE, is the default for write.table:

# create tab delimited test file with headers
write.table(BOD, file = "myfile.dat", sep = "\t")

# now read in only those records satisfying a condition
library(sqldf)

# here are the two statements
myfile <- file("myfile.dat")
DF <- sqldf("select * from myfile where demand < 10 or demand > 15",
  file.format = list(sep = "\t"))

See examples on the home page http://sqldf.googlecode.com and
see ?sqldf for info on the file.format argument in case your format differs.

Dirk Eddelbuettel

Tue, Mar 24, 2009 5:12 AM #

Hi David,

On 23 March 2009 at 15:09, Dylan Beaudette wrote:

| On Monday 23 March 2009, David Reiss wrote:

| > I have a very large tab-delimited file, too big to store in memory via
| > readLines() or read.delim(). Turns out I only need a few hundred of those
| > lines to be read in. If it were not so large, I could read the entire file
| > in and "grep" the lines I need. For such a large file; many calls to
| > read.delim() with incrementing "skip" and "nrows" parameters, followed by
| > grep() calls is very slow. I am aware of possibilities via SQLite; I would
| > prefer to not use that in this case.
| >
| > My question is...Is there a function for efficiently reading in a file
| > along the lines of read.delim(), which allows me to specify a filter (via
| > grep or something else) that tells the function to only read in certain
| > lines that match?
| >
| > If not, I would *love* to see a "filter" parameter added as an option to
| > read.delim() and/or readLines().
| 
| How about pre-filtering before loading the data into R:
| 
| grep -E 'your pattern here' your_file_here > your_filtered_file
| 
| alternatively if you need to search in fields, see 'awk', and 'cut', or if you 
| need to delete things see 'tr'.
| 
| These tools come with any unix-like OS, and you can probably get them on 
| windows without much effort.

Also note that read.delim() and friends all read from connections, and 'piped
expressions' (in the Unix shell command sense) can provide a source.

That way you can build an ad-hoc filter extension by running readLines() over
a pipe() connection.  Consider this trivial example of grepping out Section
headers from the R FAQ.  We get everything double because of the Table of
Contents and the actual section headers:

R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}' src/debian/R/R-alpha.20090320/doc/FAQ") )
 [1] "1 Introduction " "2 R Basics"      "3 R and"         "4 R Web"        
 [5] "5 R Add-On"      "6 R and"         "7 R Miscellanea" "8 R Programming"
 [9] "9 R Bugs"        "1 Introduction " "2 R Basics"      "3 R and"        
[13] "4 R Web"         "5 R Add-On"      "6 R and"         "7 R Miscellanea"
[17] "8 R Programming" "9 R Bugs"       
R> 

The regexp is simply 'digits at start of line followed by space' which skips
subsections like 1.1, 1.2, ...

Hth, Dirk

Three out of two people have difficulties with fractions.

David Reiss

Tue, Mar 24, 2009 12:11 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090324/7a7ec04d/attachment-0002.pl>