How long does skipping in read.table take
On Sat, Oct 23, 2010 at 10:52 AM, Dimitri Liakhovitski
<dimitri.liakhovitski at gmail.com> wrote:
Just tried it on my work computer (Windows XP, I only have 2 GB RAM):
I've run your code, just indicated the separator "|" in read.table (in
DF line) and added the actual processing (writing out of the result
with a file name) - see below.
I got:
Error in textConnection(x) : cannot allocate memory for text connection
Thanks again for helping!
Dimitri
### New code from Gabor:
k <- 1000000 # no of rows per chunk
first <- TRUE
con <- file('myfile.txt', "r")
count<-1
repeat {
?start<-Sys.time()
?print(start)
?flush.console()
?# skip header
?if (first) hdgs <- readLines(con, 1)
?first <- FALSE
?x <- readLines(con, k)
?if (length(x) == 0) break
?DF <- read.table(textConnection(x), header = FALSE,sep="|")
?# process chunk -- we just print last row here
?end<-Sys.time()
?print(end-start)
?print(names(DV))
?print(tail(DF, 1))
?flush.console()
?filename<-paste("Chunk of 1 Mil number ",count,".txt",sep="")
?write.table(DF,sep="\t",header=FALSE,file=filename)
?count<-count+1
}
close(con)
Try smaller chunks. Presumably R cannot handle chunks that large.
Also, you could use RSQLite or sqldf to set up a database and then
read from it. Again, don't use chunks larger than what R can handle.
Here is a self contained example that you can copy and paste into an R
session. It works on my Windows system but you might need to change
the eol if you are working on a different platform. Reading the file
into the database is the slowest part but once its there the rest
should be reasonably fast. Again be sure not to read such large
chunks at a time that R cannot handle them.
library(sqldf)
## create test file
numStr <- as.character(1:25)
DF <- data.frame(a = 1:25, 101:125)
write.table(DF, file = "myfile.csv", quote = FALSE, sep = ",",
row.names = FALSE)
## define connection with attributes
myfile <- file("myfile.csv")
attr(myfile, "file.format") <- list(header = TRUE, sep = ",", eol = "\r\n")
## create new sqlite database
sqldf("attach 'mydb' as new")
## read file into mytab table of mydb database
sqldf("create table mytab as select * from myfile", dbname = "mydb")
## check that its there
sqldf("select * from sqlite_master", dbname = "mydb")
sqldf("select count(*) from mytab", dbname = "mydb")
# Read in 5 lines after skipping 10 rows.
sqldf("select * from mytab limit 5 offset 10", dbname = "mydb")
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com