Skip to content

speeding read.table

6 messages · Jason Edgecombe, Dennis Fisher, Rui Barradas +2 more

#
R 2.15.1
OS X

Colleagues,

I am reading a 1 GB file into R using read.table.  The file consists of 100 tables, each of which is headed by two lines of characters.  
The first of these lines is:
	TABLE NO.  1
The second is a list of column headers.

For example:
TABLE NO.  1
 COL1        COL2        COL3        COL4        COL5        COL6        COL7        COL8        COL9        COL10       COL11       COL12
  1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08  0.0000E+00  0.0000E+00
  1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00  1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13  0.0000E+00  0.0000E+00

Later something similar appears:
TABLE NO.  1
 COL1        COL2        COL3        COL4        COL5        COL6        COL7        COL8        COL9        COL10       COL11       COL12
  1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00
  1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00  1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08  0.0000E+00  0.0000E+00
  1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00  1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13  0.0000E+00  0.0000E+00

I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines 

read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option)

My word-around has been to: 
	1.  read the table with readLines
	2.  remove the problematic lines
	3.  write the file to disk
	4.  read the file with read.table.
However, this process is slow.

I though about using "comment.char" as a means of avoiding reading the problematic lines.  However, comment.char does not accept ="[A-Z]" 

Are there any clever workarounds for this?

Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
#
On 10/18/2012 09:57 AM, Fisher Dennis wrote:
Create a connection to a pipe, where pipe reads from the grep command. 
Grep can exlude the problematic lines. Use the pipe object as your 
connection in read.table.
#
Jason

Are you suggesting grep in R or grep in the system?  If the latter, this won't work because I need to implement this same procedure in Windows (sorry about not mentioning this), in which grep does not exist.  If in R, the syntax is not obvious -- could you provide an example?

Dennis

Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
On Oct 18, 2012, at 7:10 AM, Jason Edgecombe wrote:

            
#
Hello,

Try the following, readaing your file into 'x', using readLines.



tc <- textConnection("
TABLE NO.  1
  COL1        COL2        COL3        COL4        COL5 COL6        
COL7        COL8        COL9        COL10 COL11       COL12
   1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00 
1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  
0.0000E+00
   1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00 
1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08 0.0000E+00  
0.0000E+00
   1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00 
1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13 0.0000E+00  
0.0000E+00

TABLE NO.  1
  COL1        COL2        COL3        COL4        COL5 COL6        
COL7        COL8        COL9        COL10 COL11       COL12
   1.0010E+05  0.0000E+00  1.0000E+00  1.0000E+03 -1.0000E+00 
1.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00  0.0000E+00 0.0000E+00  
0.0000E+00
   1.0010E+05  1.0001E+01  1.0000E+00  1.0000E+03 -1.0000E+00 
1.0000E+00  2.2737E-14 -2.2737E-14  0.0000E+00  1.9281E-08 0.0000E+00  
0.0000E+00
   1.0010E+05  2.4000E+01  1.0000E+00  2.0000E+03 -1.0000E+00 
1.0000E+00  5.7541E-15 -5.7541E-15  0.0000E+00  5.1115E-13 0.0000E+00  
0.0000E+00
")

x <- readLines(tc)
close(tc)

#------------------------ starts here
x <- x[ x != "" ]

i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
y <- x[-c(i1, i2)]

tc <- textConnection(y)
dat <- read.table(tc)
close(tc)

cnames <- unlist(strsplit(x[2], " "))
names(dat) <- cnames[cnames != ""]


Hope this helps,

Rui Barradas
Em 18-10-2012 14:57, Fisher Dennis escreveu:
#
HI,

You can also try this.??
Should remove the problematic lines, but it may not match up to Rui's method in terms of speed.

input<-readLines(textConnection("TABLE NO.? 1
?COL1??????? COL2??????? COL3??????? COL4??????? COL5??????? COL6??????? COL7??????? COL8??????? COL9??????? COL10?????? COL11?????? COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00


TABLE NO.? 1
?COL1??????? COL2??????? COL3??????? COL4??????? COL5??????? COL6??????? COL7??????? COL8??????? COL9??????? COL10?????? COL11?????? COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00
"))

?dat1<-do.call(rbind,lapply(lapply(input[grep("\\d+",input)][grepl("[\\d+]",input[grep("\\d+",input)])],function(x) unlist(strsplit(x, split=" +"))),function(x) unlist(strsplit(x, split="+ "))))
dat2<-data.frame(dat1)
dat2[]<-sapply(dat2,function(x) as.numeric(as.character(x)))
names(dat2)<-lapply(strsplit(input[grepl("COL{0,1}",input)]," +"),function(x) unlist(strsplit(x,"+ ")))[[1]]
?dat2
#??? COL1?? COL2 COL3 COL4 COL5 COL6?????? COL7??????? COL8 COL9????? COL10
#1 100100? 0.000??? 1 1000?? -1??? 1 0.0000e+00? 0.0000e+00??? 0 0.0000e+00
#2 100100 10.001??? 1 1000?? -1??? 1 2.2737e-14 -2.2737e-14??? 0 1.9281e-08
#3 100100 24.000??? 1 2000?? -1??? 1 5.7541e-15 -5.7541e-15??? 0 5.1115e-13
#4 100100? 0.000??? 1 1000?? -1??? 1 0.0000e+00? 0.0000e+00??? 0 0.0000e+00
#5 100100 10.001??? 1 1000?? -1??? 1 2.2737e-14 -2.2737e-14??? 0 1.9281e-08
#6 100100 24.000??? 1 2000?? -1??? 1 5.7541e-15 -5.7541e-15??? 0 5.1115e-13
#? COL11 COL12
#1???? 0???? 0
#2???? 0???? 0
#3???? 0???? 0
#4???? 0???? 0
#5???? 0???? 0
#6???? 0???? 0


A.K.




----- Original Message -----
From: Fisher Dennis <fisher at plessthan.com>
To: r-help at stat.math.ethz.ch
Cc: 
Sent: Thursday, October 18, 2012 9:57 AM
Subject: [R] speeding read.table

R 2.15.1
OS X

Colleagues,

I am reading a 1 GB file into R using read.table.? The file consists of 100 tables, each of which is headed by two lines of characters.? 
The first of these lines is:
??? TABLE NO.? 1
The second is a list of column headers.

For example:
TABLE NO.? 1
COL1? ? ? ? COL2? ? ? ? COL3? ? ? ? COL4? ? ? ? COL5? ? ? ? COL6? ? ? ? COL7? ? ? ? COL8? ? ? ? COL9? ? ? ? COL10? ? ?  COL11? ? ?  COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00

Later something similar appears:
TABLE NO.? 1
COL1? ? ? ? COL2? ? ? ? COL3? ? ? ? COL4? ? ? ? COL5? ? ? ? COL6? ? ? ? COL7? ? ? ? COL8? ? ? ? COL9? ? ? ? COL10? ? ?  COL11? ? ?  COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00

I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines 

read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option)

My word-around has been to: 
??? 1.? read the table with readLines
??? 2.? remove the problematic lines
??? 3.? write the file to disk
??? 4.? read the file with read.table.
However, this process is slow.

I though about using "comment.char" as a means of avoiding reading the problematic lines.? However, comment.char does not accept ="[A-Z]" 

Are there any clever workarounds for this?

Dennis


Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
#
On Thu, Oct 18, 2012 at 10:14 AM, Fisher Dennis <fisher at plessthan.com> wrote:
Windows does have find and findstr which are similar to grep.

   if (.Platform$OS.type == "windows") # use findstr
   else # use grep
help findstr