Reading large files quickly

8 messages · Gabor Grothendieck, Jakson A. Aquino, jim holtman +1 more

Original

1

8

Rob Steele

Sat, May 9, 2009 9:25 AM #

I'm finding that readLines() and read.fwf() take nearly two hours to
work through a 3.5 GB file, even when reading in large (100 MB) chunks.
 The unix command wc by contrast processes the same file in three
minutes.  Is there a faster way to read files in R?

Thanks!

Gabor Grothendieck

Sat, May 9, 2009 3:10 PM #

You could try it with sqldf and see if that is any faster.
It use RSQLite/sqlite to read the data into a database without
going through R and from there it reads all or a portion as
specified into R.  It requires two lines of code of the form:

f < file("myfile.dat")
DF <- sqldf("select * from f", dbname = tempfile())

with appropriate modification to specify the format of your file and
possibly to indicate a portion only.  See example 6 on the sqldf
home page: http://sqldf.googlecode.com
and ?sqldf


On Sat, May 9, 2009 at 12:25 PM, Rob Steele

<freenx.10.robsteele at xoxy.net> wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

jim holtman

Sat, May 9, 2009 3:11 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090509/9e2ed0ac/attachment-0001.pl>

Jakson A. Aquino

Sat, May 9, 2009 6:19 PM #

Rob Steele wrote:

I use statist to convert the fixed width data file into a csv file
because read.table() is considerably faster than read.fwf(). For example:

system("statist --na-string NA --xcols collist big.txt big.csv")
bigdf <- read.table(file = "big.csv", header=T, as.is=T)

The file collist is a text file whose lines contain the following
information:

variable begin end

where "variable" is the column name, and "begin" and "end" are integer
numbers indicating where in big.txt the columns begin and end.

Statist can be downloaded from: http://statist.wald.intevation.org/

Jakson Aquino
Social Sciences Department
Federal University of Cear?, Brazil

Rob Steele

Sat, May 9, 2009 7:09 PM #

Thanks guys, good suggestions.  To clarify, I'm running on a fast
multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1.
Paging shouldn't be an issue since I'm reading in chunks and not trying
to store the whole file in memory at once.  Thanks again.

Rob Steele wrote:

jim holtman

Sat, May 9, 2009 7:17 PM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090509/af0f2f97/attachment-0001.pl>

Rob Steele

Sun, May 10, 2009 6:39 PM #

At the moment I'm just reading the large file to see how fast it goes.
Eventually, if I can get the read time down, I'll write out a processed
version.  Thanks for suggesting scan(); I'll try it.

Rob

jim holtman wrote:

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
and provide commented, minimal, self-contained, reproducible code.

Rob Steele

Mon, May 11, 2009 8:54 AM #

Rob Steele wrote:

readChar() is fast.  I use strsplit(..., fixed = TRUE) to separate the
input data into lines and then use substr() to separate the lines into
fields.  I do a little light processing and write the result back out
with writeChar().  The whole thing takes thirty minutes where read.fwf()
took nearly two hours just to read the data.

Thanks for the help!