I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. The unix command wc by contrast processes the same file in three minutes. Is there a faster way to read files in R? Thanks!
Reading large files quickly
8 messages · Gabor Grothendieck, Jakson A. Aquino, jim holtman +1 more
You could try it with sqldf and see if that is any faster.
It use RSQLite/sqlite to read the data into a database without
going through R and from there it reads all or a portion as
specified into R. It requires two lines of code of the form:
f < file("myfile.dat")
DF <- sqldf("select * from f", dbname = tempfile())
with appropriate modification to specify the format of your file and
possibly to indicate a portion only. See example 6 on the sqldf
home page: http://sqldf.googlecode.com
and ?sqldf
On Sat, May 9, 2009 at 12:25 PM, Rob Steele
<freenx.10.robsteele at xoxy.net> wrote:
I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. ?The unix command wc by contrast processes the same file in three minutes. ?Is there a faster way to read files in R? Thanks!
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090509/9e2ed0ac/attachment-0001.pl>
Rob Steele wrote:
I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. The unix command wc by contrast processes the same file in three minutes. Is there a faster way to read files in R?
I use statist to convert the fixed width data file into a csv file
because read.table() is considerably faster than read.fwf(). For example:
system("statist --na-string NA --xcols collist big.txt big.csv")
bigdf <- read.table(file = "big.csv", header=T, as.is=T)
The file collist is a text file whose lines contain the following
information:
variable begin end
where "variable" is the column name, and "begin" and "end" are integer
numbers indicating where in big.txt the columns begin and end.
Statist can be downloaded from: http://statist.wald.intevation.org/
Jakson Aquino Social Sciences Department Federal University of Cear?, Brazil
Thanks guys, good suggestions. To clarify, I'm running on a fast multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1. Paging shouldn't be an issue since I'm reading in chunks and not trying to store the whole file in memory at once. Thanks again.
Rob Steele wrote:
I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. The unix command wc by contrast processes the same file in three minutes. Is there a faster way to read files in R? Thanks!
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090509/af0f2f97/attachment-0001.pl>
At the moment I'm just reading the large file to see how fast it goes. Eventually, if I can get the read time down, I'll write out a processed version. Thanks for suggesting scan(); I'll try it. Rob
jim holtman wrote:
Since you are reading it in chunks, I assume that you are writing out each segment as you read it in. How are you writing it out to save it? Is the time you are quoting both the reading and the writing? If so, can you break down the differences in what these operations are taking? How do you plan to use the data? Is it all numeric? Are you keeping it in a dataframe? Have you considered using 'scan' to read in the data and to specify what the columns are? If you would like some more help, the answer to these questions will help. On Sat, May 9, 2009 at 10:09 PM, Rob Steele <freenx.10.robsteele at xoxy.net>wrote:
Thanks guys, good suggestions. To clarify, I'm running on a fast multi-core server with 16 GB RAM under 64 bit CentOS 5 and R 2.8.1. Paging shouldn't be an issue since I'm reading in chunks and not trying to store the whole file in memory at once. Thanks again. Rob Steele wrote:
I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. The unix command wc by contrast processes the same file in three minutes. Is there a faster way to read files in R? Thanks! >
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> and provide commented, minimal, self-contained, reproducible code.
Rob Steele wrote:
I'm finding that readLines() and read.fwf() take nearly two hours to work through a 3.5 GB file, even when reading in large (100 MB) chunks. The unix command wc by contrast processes the same file in three minutes. Is there a faster way to read files in R? Thanks!
readChar() is fast. I use strsplit(..., fixed = TRUE) to separate the input data into lines and then use substr() to separate the lines into fields. I do a little light processing and write the result back out with writeChar(). The whole thing takes thirty minutes where read.fwf() took nearly two hours just to read the data. Thanks for the help!