Skip to content
Prev 312309 / 398506 Next

Speeding reading of large file

Jim,

My original file used Dennis' script, so it was 10000 lines.  I created a 100,000 line file and the relative results were the same.  I ran your code on the file and your second and third approaches did not produce correct results.  It may be because the original data example had 2 header lines interspersed throughout the file and some of the numbers were in scientific notation.

I modified the grep function to work with the data file, but not really being a proficient R programmer I make no claims about efficiency.  But here are my results.

cat(c("TABLE NO.  1", " PTID        TIME        AMT         FORM        PERIOD      IPRED       CWRES       EVID        CP          PRED        RES         WRES", 
"  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00  0.0000E+00  0.0000E+00", 
"  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01  0.0000E+00  0.0000E+00", 
"  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01  0.0000E+00  0.0000E+00"
)[rep(1:10, 10000)], file="c:/tmp/fisher.txt", sep="\n")

system.time({
+     # approach #1 - read in file and then delete rows with NAs
+     x <- read.table('c:/tmp/fisher.txt', as.is = TRUE, skip=1, fill=TRUE, header = TRUE)
+     # convert to numeric
+     x[] <- lapply(x, as.numeric)
+     x <- x[!is.na(x[,1]), ]
+ })
   user  system elapsed 
   1.32    0.04    1.37 
There were 12 warnings (use warnings() to see them)
PTID        TIME         AMT        FORM      PERIOD       IPRED 
160080000.0    178937.8 400000000.0    160000.0         0.0    633076.0 
      CWRES        EVID          CP        PRED         RES        WRES 
        0.0     80000.0         0.0    647352.0         0.0         0.0
+     # approach #2 -- read the lines, delete header, rewrite to temp file
+     # and then read in with read.table
+     x <- readLines('c:/tmp/fisher.txt')
+     firstLine <- x[2L]  # save header since deleted by 'grepl'
+     x <- c(firstLine, x[!grepl("[A:DF:Z]", x)])  # accept only lines that start with numeric
+     temp <- tempfile()
+     writeLines(x, temp)
+     x <- read.table(temp, as.is = TRUE, header = TRUE)
+ })
   user  system elapsed 
   2.51    0.08    2.63
PTID        TIME         AMT        FORM      PERIOD       IPRED 
160080000.0    178937.8 400000000.0    160000.0         0.0    633076.0 
      CWRES        EVID          CP        PRED         RES        WRES 
        0.0     80000.0         0.0    647352.0         0.0         0.0
+     # approach #3 -- read the lines, delete header, then use 'text' on read.table
+     x <- readLines('c:/tmp/fisher.txt')
+     firstLine <- x[2L]
+     x <- c(firstLine, x[!grepl("[A:DF:Z]", x)])
+     x <- read.table(text = x, as.is = TRUE, header = TRUE)
+ })
   user  system elapsed 
 125.64    0.03  125.67
PTID        TIME         AMT        FORM      PERIOD       IPRED 
160080000.0    178937.8 400000000.0    160000.0         0.0    633076.0 
      CWRES        EVID          CP        PRED         RES        WRES 
   
Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA 98504-5204