R 2.15.1 OS X Colleagues, I am reading a 1 GB file into R using read.table. The file consists of 100 tables, each of which is headed by two lines of characters. The first of these lines is: TABLE NO. 1 The second is a list of column headers. For example: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 Later something similar appears: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option) My word-around has been to: 1. read the table with readLines 2. remove the problematic lines 3. write the file to disk 4. read the file with read.table. However, this process is slow. I though about using "comment.char" as a means of avoiding reading the problematic lines. However, comment.char does not accept ="[A-Z]" Are there any clever workarounds for this? Dennis Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com
speeding read.table
6 messages · Jason Edgecombe, Dennis Fisher, Rui Barradas +2 more
On 10/18/2012 09:57 AM, Fisher Dennis wrote:
R 2.15.1 OS X Colleagues, I am reading a 1 GB file into R using read.table. The file consists of 100 tables, each of which is headed by two lines of characters. The first of these lines is: TABLE NO. 1 The second is a list of column headers. For example: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 Later something similar appears: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option) My word-around has been to: 1. read the table with readLines 2. remove the problematic lines 3. write the file to disk 4. read the file with read.table. However, this process is slow. I though about using "comment.char" as a means of avoiding reading the problematic lines. However, comment.char does not accept ="[A-Z]" Are there any clever workarounds for this?
Create a connection to a pipe, where pipe reads from the grep command. Grep can exlude the problematic lines. Use the pipe object as your connection in read.table.
Jason Are you suggesting grep in R or grep in the system? If the latter, this won't work because I need to implement this same procedure in Windows (sorry about not mentioning this), in which grep does not exist. If in R, the syntax is not obvious -- could you provide an example? Dennis Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com
On Oct 18, 2012, at 7:10 AM, Jason Edgecombe wrote:
On 10/18/2012 09:57 AM, Fisher Dennis wrote:
R 2.15.1 OS X Colleagues, I am reading a 1 GB file into R using read.table. The file consists of 100 tables, each of which is headed by two lines of characters. The first of these lines is: TABLE NO. 1 The second is a list of column headers. For example: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 Later something similar appears: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option) My word-around has been to: 1. read the table with readLines 2. remove the problematic lines 3. write the file to disk 4. read the file with read.table. However, this process is slow. I though about using "comment.char" as a means of avoiding reading the problematic lines. However, comment.char does not accept ="[A-Z]" Are there any clever workarounds for this?
Create a connection to a pipe, where pipe reads from the grep command. Grep can exlude the problematic lines. Use the pipe object as your connection in read.table.
Hello,
Try the following, readaing your file into 'x', using readLines.
tc <- textConnection("
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00
0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00
0.0000E+00
TABLE NO. 1
COL1 COL2 COL3 COL4 COL5 COL6
COL7 COL8 COL9 COL10 COL11 COL12
1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00
0.0000E+00
1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00
1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00
0.0000E+00
1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00
1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00
0.0000E+00
")
x <- readLines(tc)
close(tc)
#------------------------ starts here
x <- x[ x != "" ]
i1 <- grep("TABLE", x)
i2 <- grep("COL", x)
y <- x[-c(i1, i2)]
tc <- textConnection(y)
dat <- read.table(tc)
close(tc)
cnames <- unlist(strsplit(x[2], " "))
names(dat) <- cnames[cnames != ""]
Hope this helps,
Rui Barradas
Em 18-10-2012 14:57, Fisher Dennis escreveu:
R 2.15.1 OS X Colleagues, I am reading a 1 GB file into R using read.table. The file consists of 100 tables, each of which is headed by two lines of characters. The first of these lines is: TABLE NO. 1 The second is a list of column headers. For example: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 Later something similar appears: TABLE NO. 1 COL1 COL2 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10 COL11 COL12 1.0010E+05 0.0000E+00 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 0.0000E+00 1.0010E+05 1.0001E+01 1.0000E+00 1.0000E+03 -1.0000E+00 1.0000E+00 2.2737E-14 -2.2737E-14 0.0000E+00 1.9281E-08 0.0000E+00 0.0000E+00 1.0010E+05 2.4000E+01 1.0000E+00 2.0000E+03 -1.0000E+00 1.0000E+00 5.7541E-15 -5.7541E-15 0.0000E+00 5.1115E-13 0.0000E+00 0.0000E+00 I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option) My word-around has been to: 1. read the table with readLines 2. remove the problematic lines 3. write the file to disk 4. read the file with read.table. However, this process is slow. I though about using "comment.char" as a means of avoiding reading the problematic lines. However, comment.char does not accept ="[A-Z]" Are there any clever workarounds for this? Dennis Dennis Fisher MD P < (The "P Less Than" Company) Phone: 1-866-PLessThan (1-866-753-7784) Fax: 1-866-PLessThan (1-866-753-7784) www.PLessThan.com
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
HI,
You can also try this.??
Should remove the problematic lines, but it may not match up to Rui's method in terms of speed.
input<-readLines(textConnection("TABLE NO.? 1
?COL1??????? COL2??????? COL3??????? COL4??????? COL5??????? COL6??????? COL7??????? COL8??????? COL9??????? COL10?????? COL11?????? COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00
TABLE NO.? 1
?COL1??????? COL2??????? COL3??????? COL4??????? COL5??????? COL6??????? COL7??????? COL8??????? COL9??????? COL10?????? COL11?????? COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00
"))
?dat1<-do.call(rbind,lapply(lapply(input[grep("\\d+",input)][grepl("[\\d+]",input[grep("\\d+",input)])],function(x) unlist(strsplit(x, split=" +"))),function(x) unlist(strsplit(x, split="+ "))))
dat2<-data.frame(dat1)
dat2[]<-sapply(dat2,function(x) as.numeric(as.character(x)))
names(dat2)<-lapply(strsplit(input[grepl("COL{0,1}",input)]," +"),function(x) unlist(strsplit(x,"+ ")))[[1]]
?dat2
#??? COL1?? COL2 COL3 COL4 COL5 COL6?????? COL7??????? COL8 COL9????? COL10
#1 100100? 0.000??? 1 1000?? -1??? 1 0.0000e+00? 0.0000e+00??? 0 0.0000e+00
#2 100100 10.001??? 1 1000?? -1??? 1 2.2737e-14 -2.2737e-14??? 0 1.9281e-08
#3 100100 24.000??? 1 2000?? -1??? 1 5.7541e-15 -5.7541e-15??? 0 5.1115e-13
#4 100100? 0.000??? 1 1000?? -1??? 1 0.0000e+00? 0.0000e+00??? 0 0.0000e+00
#5 100100 10.001??? 1 1000?? -1??? 1 2.2737e-14 -2.2737e-14??? 0 1.9281e-08
#6 100100 24.000??? 1 2000?? -1??? 1 5.7541e-15 -5.7541e-15??? 0 5.1115e-13
#? COL11 COL12
#1???? 0???? 0
#2???? 0???? 0
#3???? 0???? 0
#4???? 0???? 0
#5???? 0???? 0
#6???? 0???? 0
A.K.
----- Original Message -----
From: Fisher Dennis <fisher at plessthan.com>
To: r-help at stat.math.ethz.ch
Cc:
Sent: Thursday, October 18, 2012 9:57 AM
Subject: [R] speeding read.table
R 2.15.1
OS X
Colleagues,
I am reading a 1 GB file into R using read.table.? The file consists of 100 tables, each of which is headed by two lines of characters.?
The first of these lines is:
??? TABLE NO.? 1
The second is a list of column headers.
For example:
TABLE NO.? 1
COL1? ? ? ? COL2? ? ? ? COL3? ? ? ? COL4? ? ? ? COL5? ? ? ? COL6? ? ? ? COL7? ? ? ? COL8? ? ? ? COL9? ? ? ? COL10? ? ? COL11? ? ? COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00
Later something similar appears:
TABLE NO.? 1
COL1? ? ? ? COL2? ? ? ? COL3? ? ? ? COL4? ? ? ? COL5? ? ? ? COL6? ? ? ? COL7? ? ? ? COL8? ? ? ? COL9? ? ? ? COL10? ? ? COL11? ? ? COL12
? 1.0010E+05? 0.0000E+00? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 1.0001E+01? 1.0000E+00? 1.0000E+03 -1.0000E+00? 1.0000E+00? 2.2737E-14 -2.2737E-14? 0.0000E+00? 1.9281E-08? 0.0000E+00? 0.0000E+00
? 1.0010E+05? 2.4000E+01? 1.0000E+00? 2.0000E+03 -1.0000E+00? 1.0000E+00? 5.7541E-15 -5.7541E-15? 0.0000E+00? 5.1115E-13? 0.0000E+00? 0.0000E+00
I will use the term "problematic lines" to refer to the repeated occurrences of the two non-data lines
read.table is not successful in reading the table because of these problematic lines (I get around the first "TABLE NO." line using the skip option)
My word-around has been to:
??? 1.? read the table with readLines
??? 2.? remove the problematic lines
??? 3.? write the file to disk
??? 4.? read the file with read.table.
However, this process is slow.
I though about using "comment.char" as a means of avoiding reading the problematic lines.? However, comment.char does not accept ="[A-Z]"
Are there any clever workarounds for this?
Dennis
Dennis Fisher MD
P < (The "P Less Than" Company)
Phone: 1-866-PLessThan (1-866-753-7784)
Fax: 1-866-PLessThan (1-866-753-7784)
www.PLessThan.com
______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
On Thu, Oct 18, 2012 at 10:14 AM, Fisher Dennis <fisher at plessthan.com> wrote:
Jason Are you suggesting grep in R or grep in the system? If the latter, this won't work because I need to implement this same procedure in Windows (sorry about not mentioning this), in which grep does not exist. If in R, the syntax is not obvious -- could you provide an example?
Windows does have find and findstr which are similar to grep. if (.Platform$OS.type == "windows") # use findstr else # use grep
From the Windows console (not from the R console):
help findstr
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com