Skip to content

Speeding reading of a large file

5 messages · Dennis Fisher, Juliet Hannah, Rui Barradas

#
Colleagues,  

This past week, I asked the following question:

	I have a file that looks that this:

	TABLE NO.  1
	 PTID        TIME        AMT         FORM        PERIOD      IPRED       CWRES       EVID        CP          PRED        RES         WRES
	  2.0010E+03  3.9375E-01  5.0000E+03  2.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  8.9583E-01  5.0000E+03  2.0000E+00  0.0000E+00  3.3389E+00  0.0000E+00  1.0000E+00  0.0000E+00  3.5321E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  1.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  5.8164E+00  0.0000E+00  1.0000E+00  0.0000E+00  5.9300E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  1.9167E+00  5.0000E+03  2.0000E+00  0.0000E+00  8.3633E+00  0.0000E+00  1.0000E+00  0.0000E+00  8.7011E+00 0.0000E+00  0.0000E+00
	  2.0010E+03  2.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.0092E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.0324E+01 0.0000E+00  0.0000E+00
	  2.0010E+03  2.9375E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1490E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1688E+01 0.0000E+00  0.0000E+00
	  2.0010E+03  3.4167E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.2940E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.3236E+01 0.0000E+00  0.0000E+00
	  2.0010E+03  4.4583E+00  5.0000E+03  2.0000E+00  0.0000E+00  1.1267E+01  0.0000E+00  1.0000E+00  0.0000E+00  1.1324E+01 0.0000E+00  0.0000E+00

	The file is reasonably large (> 10^6 lines) and the two line header is repeated periodically in the file.
	I need to read this file in as a data frame.  Note that the number of columns, the column headers, and the number of replicates of the headers are not known in advance.

I received a number of replies, many of them quite useful.  Of these, one beat out all the others in my benchmarking using files ranging from 10^5 to 10^6 lines.
That version, provided by Jim Holtman, was:
	x		<- read.table(FILE, as.is = TRUE, skip=1, fill=TRUE, header = TRUE)
	x[]		<- lapply(x, as.numeric)
	x		<- x[!is.na(x[,1]), ]

Other versions involved readLines, following by edits, following by cat (or write) to a temp file, then read.table again.  
The overhead with invoking readLines, write/cat, and read.table was substantially larger than the strategy of read.table / as.numeric / indexing

Thanks for the input from many folks.

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
2 days later
#
All,

Can someone describe what

 x[]             <- lapply(x, as.numeric)

I see that it is putting the list elements into a data frame. The
results for lapply are a list, so how does this become
a data frame.

Thanks,

Juliet
On Mon, Dec 3, 2012 at 5:49 PM, Fisher Dennis <fisher at plessthan.com> wrote:
#
Hello,

Because x[] keeps the dimensions, unlike just x.

Hope this helps,

Rui Barradas
Em 06-12-2012 16:24, Juliet Hannah escreveu:
#
Thanks, it does help. Is it possible to elaborate on how specifically
why this syntax
preserves dimensions. It this correct to just say that even though
lapply returns a list, x[] forces x to have the
same dimensions?
On Thu, Dec 6, 2012 at 11:53 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
#
Hello,

Yes, x[] forces x to keep it's dimensions. In your original post you've 
asked "how does this become a data frame". It doesn't _become_, it 
already _is_ one. The same goes for vectors, matrices and arrays. The 
dimensions stay the same.

Rui Barradas
Em 06-12-2012 17:39, Juliet Hannah escreveu: