Speeding reading of large file

Wed, Nov 28, 2012 12:14 PM

Once you've read the data with readLines()

head(raw) gets you the first 6 lines or
head(nchar(raw)) the lengths of those lines

Once you know which row has the headers (here it is 2), just use

names(dta) <- read.table(text=raw[2], stringsAsFactors=FALSE)

to add column names to the data frame.

-------
David

-----Original Message-----
From: Fisher Dennis [mailto:fisher at plessthan.com]
Sent: Wednesday, November 28, 2012 1:42 PM
To: dcarlson at tamu.edu
Cc: r-help at r-project.org
Subject: Re: [R] Speeding reading of large file

An interesting approach -- I lose the column names (which I need) but I
could get them with something cute such as:
	1.  read the first few lines only with readLines(FILENAME, n=10)
	2.  use your approach to read.table -- this will grab the column
names
	3.  replace the headers in the full version with the correct
column names

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:

Using your first approach, this should be faster

raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)

----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Fisher Dennis
Sent: Wednesday, November 28, 2012 11:43 AM
To: r-help at r-project.org
Subject: [R] Speeding reading of large file

R 2.15.1
OS X and Windows

Colleagues,

I have a file that looks that this:
TABLE NO.  1
PTID        TIME        AMT         FORM        PERIOD      IPRED
CWRES       EVID        CP          PRED        RES         WRES
 2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00

0.0000E+00

0.0000E+00
 2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00

0.0000E+00

0.0000E+00
 2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00

0.0000E+00

0.0000E+00
 2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00

0.0000E+00

0.0000E+00
 2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01

0.0000E+00

0.0000E+00
 2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01

0.0000E+00

0.0000E+00
 2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01

0.0000E+00

0.0000E+00
 2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01

0.0000E+00

0.0000E+00

The file is reasonably large (> 10^6 lines) and the two line header

is

repeated periodically in the file.
I need to read this file in as a data frame.  Note that the number

of

columns, the column headers, and the number of replicates of the
headers are not known in advance.

I have tried two approaches to this:
	First Approach:
		1.  readLines(FILENAME) to read in the file
		2.  use grep to find the repeat headers; strip out the
repeat headers
		3.  write() the object to tempfile, read in that temporary
file using read.table(tempfile, header=TRUE, skip=1) [an alternative

is

to use textConnection but that does not appear to speed things]

	Second Approach:
		1.  TEMP	<- read.table(FILENAME, header=TRUE, skip=1,
fill=TRUE, as.is=TRUE)
		2.  get rid of the errant entries with:
			TEMP[!is.na(as.numeric(TEMP[,1])),]
		3.  reading of the character entries forced all columns to
character mode.  Therefore, I convert each column to numeric:
			for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first.  With the second
approach, the conversion to numeric occupies 50% of the elapsed

time.

Is there some approach that would be much faster?  For example,

would a

vectorized approach to conversion to numeric improve throughput?

Or,

is there some means to ensure that all data are read as numeric (I
tried to use colClasses but that triggered an error when the text
string was encountered).

############################
A dput version of the data is:
c("TABLE NO.  1", " PTID        TIME        AMT         FORM
PERIOD      IPRED       CWRES       EVID        CP          PRED
RES         WRES",
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01

0.0000E+00

0.0000E+00",
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01

0.0000E+00

0.0000E+00",
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01

0.0000E+00

0.0000E+00",
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01

0.0000E+00

0.0000E+00"
)

This can be assembled into a large dataset and written to a file

named

FILENAME with the following code:
cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM
PERIOD      IPRED       CWRES       EVID        CP          PRED
RES         WRES",
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00
0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00
3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00
8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00

0.0000E+00

0.0000E+00",
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01

0.0000E+00

0.0000E+00",
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01

0.0000E+00

0.0000E+00",
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01

0.0000E+00

0.0000E+00",
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00
1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01

0.0000E+00

0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")


Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.

Speeding reading of large file

Thread (8 messages)