An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090323/ee4a8a2c/attachment-0002.pl>
read in large data file (tsv) with inline filter?
6 messages · David Reiss, Dylan Beaudette, Thomas Lumley +2 more
On Monday 23 March 2009, David Reiss wrote:
I have a very large tab-delimited file, too big to store in memory via readLines() or read.delim(). Turns out I only need a few hundred of those lines to be read in. If it were not so large, I could read the entire file in and "grep" the lines I need. For such a large file; many calls to read.delim() with incrementing "skip" and "nrows" parameters, followed by grep() calls is very slow. I am aware of possibilities via SQLite; I would prefer to not use that in this case. My question is...Is there a function for efficiently reading in a file along the lines of read.delim(), which allows me to specify a filter (via grep or something else) that tells the function to only read in certain lines that match? If not, I would *love* to see a "filter" parameter added as an option to read.delim() and/or readLines(). thanks for any pointers. --David
How about pre-filtering before loading the data into R: grep -E 'your pattern here' your_file_here > your_filtered_file alternatively if you need to search in fields, see 'awk', and 'cut', or if you need to delete things see 'tr'. These tools come with any unix-like OS, and you can probably get them on windows without much effort. Cheers, Dylan
Dylan Beaudette Soil Resource Laboratory http://casoilresource.lawr.ucdavis.edu/ University of California at Davis 530.754.7341
On Mon, 23 Mar 2009, David Reiss wrote:
I have a very large tab-delimited file, too big to store in memory via readLines() or read.delim(). Turns out I only need a few hundred of those lines to be read in. If it were not so large, I could read the entire file in and "grep" the lines I need. For such a large file; many calls to read.delim() with incrementing "skip" and "nrows" parameters, followed by grep() calls is very slow.
You certainly don't want to use repeated reads from the start of the file with skip=, but if you set up a file connection
fileconnection <- file("my.tsv", open="r")
you can read from it incrementally with readLines() or read.delim() without going back to the start each time.
The speed of approach should be within a reasonable constant factor of anything else, since reading the file once is unavoidable and should be the bottleneck.
-thomas
Thomas Lumley Assoc. Professor, Biostatistics
tlumley at u.washington.edu University of Washington, Seattle
On Mon, Mar 23, 2009 at 5:53 PM, David Reiss <dreiss at systemsbiology.org> wrote:
I have a very large tab-delimited file, too big to store in memory via readLines() or read.delim(). Turns out I only need a few hundred of those lines to be read in. If it were not so large, I could read the entire file in and "grep" the lines I need. For such a large file; many calls to read.delim() with incrementing "skip" and "nrows" parameters, followed by grep() calls is very slow. I am aware of possibilities via SQLite; I would prefer to not use that in this case.
Why the restriction? Using sqldf with sqlite its only
two R statements.
The first statement defines the name of the file and the
second statement defines what you want to extract from
it and its format. First create a sample file using built in
data frame BOD and lets specify tab delimiters. A header,
i.e. col.names = TRUE, is the default for write.table:
# create tab delimited test file with headers
write.table(BOD, file = "myfile.dat", sep = "\t")
# now read in only those records satisfying a condition
library(sqldf)
# here are the two statements
myfile <- file("myfile.dat")
DF <- sqldf("select * from myfile where demand < 10 or demand > 15",
file.format = list(sep = "\t"))
See examples on the home page http://sqldf.googlecode.com and
see ?sqldf for info on the file.format argument in case your format differs.
Hi David,
On 23 March 2009 at 15:09, Dylan Beaudette wrote:
| On Monday 23 March 2009, David Reiss wrote:
| > I have a very large tab-delimited file, too big to store in memory via
| > readLines() or read.delim(). Turns out I only need a few hundred of those
| > lines to be read in. If it were not so large, I could read the entire file
| > in and "grep" the lines I need. For such a large file; many calls to
| > read.delim() with incrementing "skip" and "nrows" parameters, followed by
| > grep() calls is very slow. I am aware of possibilities via SQLite; I would
| > prefer to not use that in this case.
| >
| > My question is...Is there a function for efficiently reading in a file
| > along the lines of read.delim(), which allows me to specify a filter (via
| > grep or something else) that tells the function to only read in certain
| > lines that match?
| >
| > If not, I would *love* to see a "filter" parameter added as an option to
| > read.delim() and/or readLines().
|
| How about pre-filtering before loading the data into R:
|
| grep -E 'your pattern here' your_file_here > your_filtered_file
|
| alternatively if you need to search in fields, see 'awk', and 'cut', or if you
| need to delete things see 'tr'.
|
| These tools come with any unix-like OS, and you can probably get them on
| windows without much effort.
Also note that read.delim() and friends all read from connections, and 'piped
expressions' (in the Unix shell command sense) can provide a source.
That way you can build an ad-hoc filter extension by running readLines() over
a pipe() connection. Consider this trivial example of grepping out Section
headers from the R FAQ. We get everything double because of the Table of
Contents and the actual section headers:
R> readLines( pipe("awk '/^[0-9+] / {print $1, $2, $3}' src/debian/R/R-alpha.20090320/doc/FAQ") )
[1] "1 Introduction " "2 R Basics" "3 R and" "4 R Web"
[5] "5 R Add-On" "6 R and" "7 R Miscellanea" "8 R Programming"
[9] "9 R Bugs" "1 Introduction " "2 R Basics" "3 R and"
[13] "4 R Web" "5 R Add-On" "6 R and" "7 R Miscellanea"
[17] "8 R Programming" "9 R Bugs"
R>
The regexp is simply 'digits at start of line followed by space' which skips
subsections like 1.1, 1.2, ...
Hth, Dirk
Three out of two people have difficulties with fractions.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090324/7a7ec04d/attachment-0002.pl>