Incremental ReadLines
I have two suggestions to speed up your code, if you
must use a loop.
First, don't grow your output dataset at each iteration.
Instead of
cases <- 0
output <- numeric(cases)
while(length(line <- readLines(input, n=1))==1) {
cases <- cases + 1
output[cases] <- as.numeric(line)
}
preallocate the output vector to be about the size of
its eventual length (slightly bigger is better), replacing
output <- numeric(0)
with the likes of
output <- numeric(500000)
and when you are done with the loop trim down the length
if it is too big
if (cases < length(output)) length(output) <- cases
Growing your dataset in a loop can cause quadratic or worse
growth in time with problem size and the above sort of
code should make the time grow linearly with problem size.
Second, don't do data.frame subscripting inside your loop.
Instead of
data <- data.frame(Id=numeric(cases))
while(...) {
data[cases, 1] <- newValue
}
do
Id <- numeric(cases)
while(...) {
Id[cases] <- newValue
}
data <- data.frame(Id = Id)
This is just the general principal that you don't want to
repeat the same operation over and over in a loop.
dataFrame[i,j] first extracts column j then extracts element
i from that column. Since the column is the same every iteration
you may as well extract the column outside of the loop.
Avoiding the loop altogether is the fastest. E.g., the code
you showed does the same thing as
idLines <- grep(value=TRUE, "Id:", readLines(file))
data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*", "", idLines)))
You can also use an external process (perl or grep) to filter
out the lines that are not of interest.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Freds
Sent: Wednesday, April 13, 2011 10:58 AM
To: r-help at r-project.org
Subject: Re: [R] Incremental ReadLines
Hi there,
I am having a similar problem with reading in a large text
file with around
550.000 observations with each 10 to 100 lines of
description. I am trying
to parse it in R but I have troubles with the size of the
file. It seems
like it is slowing down dramatically at some point. I would
be happy for any
suggestions. Here is my code, which works fine when I am
doing a subsample
of my dataset.
#Defining datasource
file <- "filename.txt"
#Creating placeholder for data and assigning column names
data <- data.frame(Id=NA)
#Starting by case = 0
case <- 0
#Opening a connection to data
input <- file(file, "rt")
#Going through cases
repeat {
line <- readLines(input, n=1)
if (length(line)==0) break
if (length(grep("Id:",line)) != 0) {
case <- case + 1 ; data[case,] <-NA
split_line <- strsplit(line,"Id:")
data[case,1] <- as.numeric(split_line[[1]][2])
}
}
#Closing connection
close(input)
#Saving dataframe
write.csv(data,'data.csv')
Kind regards,
Frederik
--
View this message in context:
http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
447859.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.