R 2.15.1
OS X and Windows
Colleagues,
I have a file that looks that this:
TABLE NO. 1
PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES
2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00
2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00
2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00
2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00
2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00
2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00
2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00
The file is reasonably large (> 10^6 lines) and the two line header is repeated periodically in the file.
I need to read this file in as a data frame. Note that the number of columns, the column headers, and the number of replicates of the headers are not known in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the repeat headers
3. write() the object to tempfile, read in that temporary file using read.table(tempfile, header=TRUE, skip=1) [an alternative is to use textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE, skip=1, fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all columns to character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <- as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the second approach, the conversion to numeric occupies 50% of the elapsed time.
Is there some approach that would be much faster? For example, would a vectorized approach to conversion to numeric improve throughput? Or, is there some means to ensure that all data are read as numeric (I tried to use colClasses but that triggered an error when the text string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)
This can be assembled into a large dataset and written to a file named FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
Speeding reading of large file
8 messages · Rui Barradas, David L Carlson, Dennis Fisher +2 more
Hello, Try to convert back to numeric with TEMP[] <- lapply(TEMP, as.numeric) In a small df (1e6 rows, 2 cols) it's 3 times faster than the for loop. Hope this helps, Rui Barradas Em 28-11-2012 17:42, Fisher Dennis escreveu:
R 2.15.1
OS X and Windows
Colleagues,
I have a file that looks that this:
TABLE NO. 1
PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES
2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00
2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00
2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00
2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00
2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00
2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00
2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00
The file is reasonably large (> 10^6 lines) and the two line header is repeated periodically in the file.
I need to read this file in as a data frame. Note that the number of columns, the column headers, and the number of replicates of the headers are not known in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the repeat headers
3. write() the object to tempfile, read in that temporary file using read.table(tempfile, header=TRUE, skip=1) [an alternative is to use textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE, skip=1, fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all columns to character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <- as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the second approach, the conversion to numeric occupies 50% of the elapsed time.
Is there some approach that would be much faster? For example, would a vectorized approach to conversion to numeric improve throughput? Or, is there some means to ensure that all data are read as numeric (I tried to use colClasses but that triggered an error when the text string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)
This can be assembled into a large dataset and written to a file named FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Fisher Dennis
Sent: Wednesday, November 28, 2012 11:43 AM
To: r-help at r-project.org
Subject: [R] Speeding reading of large file
R 2.15.1
OS X and Windows
Colleagues,
I have a file that looks that this:
TABLE NO. 1
PTID TIME AMT FORM PERIOD IPRED
CWRES EVID CP PRED RES WRES
2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
0.0000E+00
2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
0.0000E+00
2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
0.0000E+00
2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
0.0000E+00
2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
0.0000E+00
2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
0.0000E+00
2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
0.0000E+00
The file is reasonably large (> 10^6 lines) and the two line header is
repeated periodically in the file.
I need to read this file in as a data frame. Note that the number of
columns, the column headers, and the number of replicates of the
headers are not known in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the
repeat headers
3. write() the object to tempfile, read in that temporary
file using read.table(tempfile, header=TRUE, skip=1) [an alternative is
to use textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE, skip=1,
fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all columns to
character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the second
approach, the conversion to numeric occupies 50% of the elapsed time.
Is there some approach that would be much faster? For example, would a
vectorized approach to conversion to numeric improve throughput? Or,
is there some means to ensure that all data are read as numeric (I
tried to use colClasses but that triggered an error when the text
string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
0.0000E+00"
)
This can be assembled into a large dataset and written to a file named
FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
An interesting approach -- I lose the column names (which I need) but I could get them with something cute such as: 1. read the first few lines only with readLines(FILENAME, n=10) 2. use your approach to read.table -- this will grab the column names 3. replace the headers in the full version with the correct column names Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com
On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Fisher Dennis
Sent: Wednesday, November 28, 2012 11:43 AM
To: r-help at r-project.org
Subject: [R] Speeding reading of large file
R 2.15.1
OS X and Windows
Colleagues,
I have a file that looks that this:
TABLE NO. 1
PTID TIME AMT FORM PERIOD IPRED
CWRES EVID CP PRED RES WRES
2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
0.0000E+00
2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
0.0000E+00
2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
0.0000E+00
2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
0.0000E+00
2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
0.0000E+00
2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
0.0000E+00
2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
0.0000E+00
The file is reasonably large (> 10^6 lines) and the two line header is
repeated periodically in the file.
I need to read this file in as a data frame. Note that the number of
columns, the column headers, and the number of replicates of the
headers are not known in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the
repeat headers
3. write() the object to tempfile, read in that temporary
file using read.table(tempfile, header=TRUE, skip=1) [an alternative is
to use textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE, skip=1,
fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all columns to
character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the second
approach, the conversion to numeric occupies 50% of the elapsed time.
Is there some approach that would be much faster? For example, would a
vectorized approach to conversion to numeric improve throughput? Or,
is there some means to ensure that all data are read as numeric (I
tried to use colClasses but that triggered an error when the text
string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
0.0000E+00"
)
This can be assembled into a large dataset and written to a file named
FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00
3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00
8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00
0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00
1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00
0.0000E+00"
)[rep(1:10, 1000)], file="FILENAME", sep="\n")
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
Once you've read the data with readLines() head(raw) gets you the first 6 lines or head(nchar(raw)) the lengths of those lines Once you know which row has the headers (here it is 2), just use names(dta) <- read.table(text=raw[2], stringsAsFactors=FALSE) to add column names to the data frame. ------- David
-----Original Message----- From: Fisher Dennis [mailto:fisher at plessthan.com] Sent: Wednesday, November 28, 2012 1:42 PM To: dcarlson at tamu.edu Cc: r-help at r-project.org Subject: Re: [R] Speeding reading of large file An interesting approach -- I lose the column names (which I need) but I could get them with something cute such as: 1. read the first few lines only with readLines(FILENAME, n=10) 2. use your approach to read.table -- this will grab the column names 3. replace the headers in the full version with the correct column names Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Fisher Dennis Sent: Wednesday, November 28, 2012 11:43 AM To: r-help at r-project.org Subject: [R] Speeding reading of large file R 2.15.1 OS X and Windows Colleagues, I have a file that looks that this: TABLE NO. 1 PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00 The file is reasonably large (> 10^6 lines) and the two line header
is
repeated periodically in the file. I need to read this file in as a data frame. Note that the number
of
columns, the column headers, and the number of replicates of the headers are not known in advance. I have tried two approaches to this: First Approach: 1. readLines(FILENAME) to read in the file 2. use grep to find the repeat headers; strip out the repeat headers 3. write() the object to tempfile, read in that temporary file using read.table(tempfile, header=TRUE, skip=1) [an alternative
is
to use textConnection but that does not appear to speed things] Second Approach: 1. TEMP <- read.table(FILENAME, header=TRUE, skip=1, fill=TRUE, as.is=TRUE) 2. get rid of the errant entries with: TEMP[!is.na(as.numeric(TEMP[,1])),] 3. reading of the character entries forced all columns to character mode. Therefore, I convert each column to numeric: for (COL in 1:ncol(TEMP)) TEMP[,COL] <- as.numeric(TEMP[,COL]) The second approach is ~ 20% faster than the first. With the second approach, the conversion to numeric occupies 50% of the elapsed
time.
Is there some approach that would be much faster? For example,
would a
vectorized approach to conversion to numeric improve throughput?
Or,
is there some means to ensure that all data are read as numeric (I
tried to use colClasses but that triggered an error when the text
string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" ) This can be assembled into a large dataset and written to a file
named
FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" )[rep(1:10, 1000)], file="FILENAME", sep="\n") Dennis Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code.
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Fisher Dennis Sent: Wednesday, November 28, 2012 11:42 AM To: dcarlson at tamu.edu Cc: r-help at r-project.org Subject: Re: [R] Speeding reading of large file An interesting approach -- I lose the column names (which I need) but I could get them with something cute such as: 1. read the first few lines only with readLines(FILENAME, n=10) 2. use your approach to read.table -- this will grab the column names 3. replace the headers in the full version with the correct column names Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Fisher Dennis Sent: Wednesday, November 28, 2012 11:43 AM To: r-help at r-project.org Subject: [R] Speeding reading of large file R 2.15.1 OS X and Windows Colleagues, I have a file that looks that this: TABLE NO. 1 PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00 The file is reasonably large (> 10^6 lines) and the two line header
is
repeated periodically in the file. I need to read this file in as a data frame. Note that the number
of
columns, the column headers, and the number of replicates of the headers are not known in advance. I have tried two approaches to this: First Approach: 1. readLines(FILENAME) to read in the file 2. use grep to find the repeat headers; strip out the repeat headers 3. write() the object to tempfile, read in that temporary file using read.table(tempfile, header=TRUE, skip=1) [an alternative
is
to use textConnection but that does not appear to speed things] Second Approach: 1. TEMP <- read.table(FILENAME, header=TRUE, skip=1, fill=TRUE, as.is=TRUE) 2. get rid of the errant entries with: TEMP[!is.na(as.numeric(TEMP[,1])),] 3. reading of the character entries forced all columns to character mode. Therefore, I convert each column to numeric: for (COL in 1:ncol(TEMP)) TEMP[,COL] <- as.numeric(TEMP[,COL]) The second approach is ~ 20% faster than the first. With the second approach, the conversion to numeric occupies 50% of the elapsed
time.
Is there some approach that would be much faster? For example,
would a
vectorized approach to conversion to numeric improve throughput?
Or,
is there some means to ensure that all data are read as numeric (I
tried to use colClasses but that triggered an error when the text
string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" ) This can be assembled into a large dataset and written to a file
named
FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" )[rep(1:10, 1000)], file="FILENAME", sep="\n") Dennis
Dennis,
I used your code to create the test file, and then used two different method to read the file
# method 1
system.time({
fisher <- read.table('c:/tmp/fisher.txt', header=TRUE,skip=1,fill=TRUE, as.is=TRUE)
fisher <- data.frame(apply(fisher,2,as.numeric))
fisher <- fisher[!is.na(fisher$PTID),]
})
user system elapsed
0.14 0.00 0.14
There were 12 warnings (use warnings() to see them)
# method 2
system.time({
raw <- readLines(con='c:/tmp/fisher.txt')
fisher2 <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE, fill=TRUE)
names <- read.table('c:/tmp/fisher.txt',header=TRUE,skip=1,nrows=1)
colnames(fisher2) <- colnames(names)
})
user system elapsed
1.31 0.00 1.31
Method 1 was substantially faster than method 2. One thing I don't like about method 1 is the warnings (about NA's being created by as.numeric). However they are essentially harmless.
Hope this is helpful,
Dan
Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
How long was the file that you tested? Here is a test with a file that is 110400 lines long with 4416 replicated headers that will have to be removed. Using 'text=' or textConnection is very slow for these operations. Writing to a temporary file can be faster for especially large files. Notice that this is the fastest method for this file. Here are three approaches and their times: ############################
system.time({
+ # approach #1 - read in file and then delete rows with NAs
+ x <- read.table('/temp/text.txt', as.is = TRUE, header = TRUE)
+ # convert to numeric
+ x[] <- lapply(x, as.numeric)
+ x <- x[!is.na(x[,1]), ]
+ })
user system elapsed
0.70 0.00 0.72
Warning messages:
1: In lapply(x, as.numeric) : NAs introduced by coercion
2: In lapply(x, as.numeric) : NAs introduced by coercion
3: In lapply(x, as.numeric) : NAs introduced by coercion
4: In lapply(x, as.numeric) : NAs introduced by coercion
5: In lapply(x, as.numeric) : NAs introduced by coercion
str(x)
'data.frame': 105984 obs. of 5 variables: $ a: num 1 1 1 1 1 1 1 1 1 1 ... $ b: num 2 2 2 2 2 2 2 2 2 2 ... $ c: num 3 3 3 3 3 3 3 3 3 3 ... $ d: num 4 4 4 4 4 4 4 4 4 4 ... $ e: num 5 5 5 5 5 5 5 5 5 5 ...
colSums(x)
a b c d e 105984 211968 317952 423936 529920
system.time({
+ # approach #2 -- read the lines, delete header, rewrite to temp file
+ # and then read in with read.table
+ x <- readLines('/temp/text.txt')
+ firstLine <- x[1L] # save header since deleted by 'grepl'
+ x <- c(firstLine, x[grepl("^[0-9]", x)]) # accept only lines
that start with numeric
+ temp <- tempfile()
+ writeLines(x, temp)
+ x <- read.table(temp, as.is = TRUE, header = TRUE)
+ })
user system elapsed
0.55 0.02 0.56
str(x)
'data.frame': 105984 obs. of 5 variables: $ a: int 1 1 1 1 1 1 1 1 1 1 ... $ b: int 2 2 2 2 2 2 2 2 2 2 ... $ c: int 3 3 3 3 3 3 3 3 3 3 ... $ d: int 4 4 4 4 4 4 4 4 4 4 ... $ e: int 5 5 5 5 5 5 5 5 5 5 ...
colSums(x)
a b c d e 105984 211968 317952 423936 529920
system.time({
+ # approach #3 -- read the lines, delete header, then use 'text'
on read.table
+ x <- readLines('/temp/text.txt')
+ firstLine <- x[1L]
+ x <- c(firstLine, x[grepl("^[0-9]", x)])
+ x <- read.table(text = x, as.is = TRUE, header = TRUE)
+ })
user system elapsed
29.01 0.01 29.62
str(x)
'data.frame': 105984 obs. of 5 variables: $ a: int 1 1 1 1 1 1 1 1 1 1 ... $ b: int 2 2 2 2 2 2 2 2 2 2 ... $ c: int 3 3 3 3 3 3 3 3 3 3 ... $ d: int 4 4 4 4 4 4 4 4 4 4 ... $ e: int 5 5 5 5 5 5 5 5 5 5 ...
colSums(x)
a b c d e 105984 211968 317952 423936 529920 On Wed, Nov 28, 2012 at 7:01 PM, Nordlund, Dan (DSHS/RDA)
<NordlDJ at dshs.wa.gov> wrote:
-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
project.org] On Behalf Of Fisher Dennis
Sent: Wednesday, November 28, 2012 11:42 AM
To: dcarlson at tamu.edu
Cc: r-help at r-project.org
Subject: Re: [R] Speeding reading of large file
An interesting approach -- I lose the column names (which I need) but I
could get them with something cute such as:
1. read the first few lines only with readLines(FILENAME, n=10)
2. use your approach to read.table -- this will grab the column
names
3. replace the headers in the full version with the correct
column names
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Fisher Dennis Sent: Wednesday, November 28, 2012 11:43 AM To: r-help at r-project.org Subject: [R] Speeding reading of large file R 2.15.1 OS X and Windows Colleagues, I have a file that looks that this: TABLE NO. 1 PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00 The file is reasonably large (> 10^6 lines) and the two line header
is
repeated periodically in the file. I need to read this file in as a data frame. Note that the number
of
columns, the column headers, and the number of replicates of the
headers are not known in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the
repeat headers
3. write() the object to tempfile, read in that temporary
file using read.table(tempfile, header=TRUE, skip=1) [an alternative
is
to use textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE, skip=1,
fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all columns to
character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the second
approach, the conversion to numeric occupies 50% of the elapsed
time.
Is there some approach that would be much faster? For example,
would a
vectorized approach to conversion to numeric improve throughput?
Or,
is there some means to ensure that all data are read as numeric (I
tried to use colClasses but that triggered an error when the text
string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" ) This can be assembled into a large dataset and written to a file
named
FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" )[rep(1:10, 1000)], file="FILENAME", sep="\n") Dennis
Dennis,
I used your code to create the test file, and then used two different method to read the file
# method 1
system.time({
fisher <- read.table('c:/tmp/fisher.txt', header=TRUE,skip=1,fill=TRUE, as.is=TRUE)
fisher <- data.frame(apply(fisher,2,as.numeric))
fisher <- fisher[!is.na(fisher$PTID),]
})
user system elapsed
0.14 0.00 0.14
There were 12 warnings (use warnings() to see them)
# method 2
system.time({
raw <- readLines(con='c:/tmp/fisher.txt')
fisher2 <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE, fill=TRUE)
names <- read.table('c:/tmp/fisher.txt',header=TRUE,skip=1,nrows=1)
colnames(fisher2) <- colnames(names)
})
user system elapsed
1.31 0.00 1.31
Method 1 was substantially faster than method 2. One thing I don't like about method 1 is the warnings (about NA's being created by as.numeric). However they are essentially harmless.
Hope this is helpful,
Dan
Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.
Jim,
My original file used Dennis' script, so it was 10000 lines. I created a 100,000 line file and the relative results were the same. I ran your code on the file and your second and third approaches did not produce correct results. It may be because the original data example had 2 header lines interspersed throughout the file and some of the numbers were in scientific notation.
I modified the grep function to work with the data file, but not really being a proficient R programmer I make no claims about efficiency. But here are my results.
cat(c("TABLE NO. 1", " PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01 0.0000E+00 0.0000E+00",
" 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01 0.0000E+00 0.0000E+00"
)[rep(1:10, 10000)], file="c:/tmp/fisher.txt", sep="\n")
system.time({
+ # approach #1 - read in file and then delete rows with NAs
+ x <- read.table('c:/tmp/fisher.txt', as.is = TRUE, skip=1, fill=TRUE, header = TRUE)
+ # convert to numeric
+ x[] <- lapply(x, as.numeric)
+ x <- x[!is.na(x[,1]), ]
+ })
user system elapsed
1.32 0.04 1.37
There were 12 warnings (use warnings() to see them)
colSums(x)
PTID TIME AMT FORM PERIOD IPRED
160080000.0 178937.8 400000000.0 160000.0 0.0 633076.0
CWRES EVID CP PRED RES WRES
0.0 80000.0 0.0 647352.0 0.0 0.0
system.time({
+ # approach #2 -- read the lines, delete header, rewrite to temp file
+ # and then read in with read.table
+ x <- readLines('c:/tmp/fisher.txt')
+ firstLine <- x[2L] # save header since deleted by 'grepl'
+ x <- c(firstLine, x[!grepl("[A:DF:Z]", x)]) # accept only lines that start with numeric
+ temp <- tempfile()
+ writeLines(x, temp)
+ x <- read.table(temp, as.is = TRUE, header = TRUE)
+ })
user system elapsed
2.51 0.08 2.63
colSums(x)
PTID TIME AMT FORM PERIOD IPRED
160080000.0 178937.8 400000000.0 160000.0 0.0 633076.0
CWRES EVID CP PRED RES WRES
0.0 80000.0 0.0 647352.0 0.0 0.0
system.time({
+ # approach #3 -- read the lines, delete header, then use 'text' on read.table
+ x <- readLines('c:/tmp/fisher.txt')
+ firstLine <- x[2L]
+ x <- c(firstLine, x[!grepl("[A:DF:Z]", x)])
+ x <- read.table(text = x, as.is = TRUE, header = TRUE)
+ })
user system elapsed
125.64 0.03 125.67
colSums(x)
PTID TIME AMT FORM PERIOD IPRED
160080000.0 178937.8 400000000.0 160000.0 0.0 633076.0
CWRES EVID CP PRED RES WRES
Dan
Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204
-----Original Message----- From: jim holtman [mailto:jholtman at gmail.com] Sent: Wednesday, November 28, 2012 6:05 PM To: Nordlund, Dan (DSHS/RDA) Cc: Fisher Dennis; r-help at r-project.org Subject: Re: [R] Speeding reading of large file How long was the file that you tested? Here is a test with a file that is 110400 lines long with 4416 replicated headers that will have to be removed. Using 'text=' or textConnection is very slow for these operations. Writing to a temporary file can be faster for especially large files. Notice that this is the fastest method for this file. Here are three approaches and their times: ############################
system.time({
+ # approach #1 - read in file and then delete rows with NAs
+ x <- read.table('/temp/text.txt', as.is = TRUE, header = TRUE)
+ # convert to numeric
+ x[] <- lapply(x, as.numeric)
+ x <- x[!is.na(x[,1]), ]
+ })
user system elapsed
0.70 0.00 0.72
Warning messages:
1: In lapply(x, as.numeric) : NAs introduced by coercion
2: In lapply(x, as.numeric) : NAs introduced by coercion
3: In lapply(x, as.numeric) : NAs introduced by coercion
4: In lapply(x, as.numeric) : NAs introduced by coercion
5: In lapply(x, as.numeric) : NAs introduced by coercion
str(x)
'data.frame': 105984 obs. of 5 variables: $ a: num 1 1 1 1 1 1 1 1 1 1 ... $ b: num 2 2 2 2 2 2 2 2 2 2 ... $ c: num 3 3 3 3 3 3 3 3 3 3 ... $ d: num 4 4 4 4 4 4 4 4 4 4 ... $ e: num 5 5 5 5 5 5 5 5 5 5 ...
colSums(x)
a b c d e 105984 211968 317952 423936 529920
system.time({
+ # approach #2 -- read the lines, delete header, rewrite to temp
file
+ # and then read in with read.table
+ x <- readLines('/temp/text.txt')
+ firstLine <- x[1L] # save header since deleted by 'grepl'
+ x <- c(firstLine, x[grepl("^[0-9]", x)]) # accept only lines
that start with numeric
+ temp <- tempfile()
+ writeLines(x, temp)
+ x <- read.table(temp, as.is = TRUE, header = TRUE)
+ })
user system elapsed
0.55 0.02 0.56
str(x)
'data.frame': 105984 obs. of 5 variables: $ a: int 1 1 1 1 1 1 1 1 1 1 ... $ b: int 2 2 2 2 2 2 2 2 2 2 ... $ c: int 3 3 3 3 3 3 3 3 3 3 ... $ d: int 4 4 4 4 4 4 4 4 4 4 ... $ e: int 5 5 5 5 5 5 5 5 5 5 ...
colSums(x)
a b c d e 105984 211968 317952 423936 529920
system.time({
+ # approach #3 -- read the lines, delete header, then use 'text'
on read.table
+ x <- readLines('/temp/text.txt')
+ firstLine <- x[1L]
+ x <- c(firstLine, x[grepl("^[0-9]", x)])
+ x <- read.table(text = x, as.is = TRUE, header = TRUE)
+ })
user system elapsed
29.01 0.01 29.62
str(x)
'data.frame': 105984 obs. of 5 variables: $ a: int 1 1 1 1 1 1 1 1 1 1 ... $ b: int 2 2 2 2 2 2 2 2 2 2 ... $ c: int 3 3 3 3 3 3 3 3 3 3 ... $ d: int 4 4 4 4 4 4 4 4 4 4 ... $ e: int 5 5 5 5 5 5 5 5 5 5 ...
colSums(x)
a b c d e 105984 211968 317952 423936 529920 On Wed, Nov 28, 2012 at 7:01 PM, Nordlund, Dan (DSHS/RDA) <NordlDJ at dshs.wa.gov> wrote:
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Fisher Dennis Sent: Wednesday, November 28, 2012 11:42 AM To: dcarlson at tamu.edu Cc: r-help at r-project.org Subject: Re: [R] Speeding reading of large file An interesting approach -- I lose the column names (which I need)
but I
could get them with something cute such as:
1. read the first few lines only with readLines(FILENAME,
n=10)
2. use your approach to read.table -- this will grab the
column
names
3. replace the headers in the full version with the correct
column names
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
On Nov 28, 2012, at 11:32 AM, David L Carlson wrote:
Using your first approach, this should be faster
raw <- readLines(con=filename)
dta <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)], header=FALSE)
----------------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77843-4352
-----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r- project.org] On Behalf Of Fisher Dennis Sent: Wednesday, November 28, 2012 11:43 AM To: r-help at r-project.org Subject: [R] Speeding reading of large file R 2.15.1 OS X and Windows Colleagues, I have a file that looks that this: TABLE NO. 1 PTID TIME AMT FORM PERIOD IPRED CWRES EVID CP PRED RES WRES 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00 The file is reasonably large (> 10^6 lines) and the two line
header
is
repeated periodically in the file. I need to read this file in as a data frame. Note that the
number
of
columns, the column headers, and the number of replicates of the
headers are not known in advance.
I have tried two approaches to this:
First Approach:
1. readLines(FILENAME) to read in the file
2. use grep to find the repeat headers; strip out the
repeat headers
3. write() the object to tempfile, read in that
temporary
file using read.table(tempfile, header=TRUE, skip=1) [an
alternative
is
to use textConnection but that does not appear to speed things]
Second Approach:
1. TEMP <- read.table(FILENAME, header=TRUE,
skip=1,
fill=TRUE, as.is=TRUE)
2. get rid of the errant entries with:
TEMP[!is.na(as.numeric(TEMP[,1])),]
3. reading of the character entries forced all
columns to
character mode. Therefore, I convert each column to numeric:
for (COL in 1:ncol(TEMP)) TEMP[,COL] <-
as.numeric(TEMP[,COL])
The second approach is ~ 20% faster than the first. With the
second
approach, the conversion to numeric occupies 50% of the elapsed
time.
Is there some approach that would be much faster? For example,
would a
vectorized approach to conversion to numeric improve throughput?
Or,
is there some means to ensure that all data are read as numeric
(I
tried to use colClasses but that triggered an error when the text
string was encountered).
############################
A dput version of the data is:
c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" ) This can be assembled into a large dataset and written to a file
named
FILENAME with the following code:
cat(c("TABLE NO. 1", " PTID TIME AMT FORM
PERIOD IPRED CWRES EVID CP PRED
RES WRES",
" 2.0010E+03 3.9375E-01 5.0000E+03 2.0000E+00 0.0000E+00
0.0000E+00 0.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 8.9583E-01 5.0000E+03 2.0000E+00 0.0000E+00 3.3389E+00 0.0000E+00 1.0000E+00 0.0000E+00 3.5321E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 5.8164E+00 0.0000E+00 1.0000E+00 0.0000E+00 5.9300E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 1.9167E+00 5.0000E+03 2.0000E+00 0.0000E+00 8.3633E+00 0.0000E+00 1.0000E+00 0.0000E+00 8.7011E+00
0.0000E+00
0.0000E+00", " 2.0010E+03 2.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.0092E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.0324E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 2.9375E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1490E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1688E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 3.4167E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.2940E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.3236E+01
0.0000E+00
0.0000E+00", " 2.0010E+03 4.4583E+00 5.0000E+03 2.0000E+00 0.0000E+00 1.1267E+01 0.0000E+00 1.0000E+00 0.0000E+00 1.1324E+01
0.0000E+00
0.0000E+00" )[rep(1:10, 1000)], file="FILENAME", sep="\n") Dennis
Dennis, I used your code to create the test file, and then used two different
method to read the file
# method 1
system.time({
fisher <- read.table('c:/tmp/fisher.txt',
header=TRUE,skip=1,fill=TRUE, as.is=TRUE)
fisher <- data.frame(apply(fisher,2,as.numeric))
fisher <- fisher[!is.na(fisher$PTID),]
})
user system elapsed
0.14 0.00 0.14
There were 12 warnings (use warnings() to see them)
# method 2
system.time({
raw <- readLines(con='c:/tmp/fisher.txt')
fisher2 <- read.table(text=raw[!grepl("[A:DF:Z]" ,raw)],
header=FALSE, fill=TRUE)
names <- read.table('c:/tmp/fisher.txt',header=TRUE,skip=1,nrows=1)
colnames(fisher2) <- colnames(names)
})
user system elapsed
1.31 0.00 1.31
Method 1 was substantially faster than method 2. One thing I don't
like about method 1 is the warnings (about NA's being created by as.numeric). However they are essentially harmless.
Hope this is helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-
guide.html
and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it.