Hi there,
I am having a similar problem with reading in a large text file with around
550.000 observations with each 10 to 100 lines of description. I am trying
to parse it in R but I have troubles with the size of the file. It seems
like it is slowing down dramatically at some point. I would be happy for any
suggestions. Here is my code, which works fine when I am doing a subsample
of my dataset.
#Defining datasource
file <- "filename.txt"
#Creating placeholder for data and assigning column names
data <- data.frame(Id=NA)
#Starting by case = 0
case <- 0
#Opening a connection to data
input <- file(file, "rt")
#Going through cases
repeat {
line <- readLines(input, n=1)
if (length(line)==0) break
if (length(grep("Id:",line)) != 0) {
case <- case + 1 ; data[case,] <-NA
split_line <- strsplit(line,"Id:")
data[case,1] <- as.numeric(split_line[[1]][2])
}
}
#Closing connection
close(input)
#Saving dataframe
write.csv(data,'data.csv')
Kind regards,
Frederik
--
View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
Sent from the R help mailing list archive at Nabble.com.
Incremental ReadLines
8 messages · Freds, Frederik Lang, Mike Marchywka +1 more
----------------------------------------
Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederiklang at gmail.com To: r-help at r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any
This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A "readline()" method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your "read" would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help.
suggestions. Here is my code, which works fine when I am doing a subsample
of my dataset.
#Defining datasource
file <- "filename.txt"
#Creating placeholder for data and assigning column names
data <- data.frame(Id=NA)
#Starting by case = 0
case <- 0
#Opening a connection to data
input <- file(file, "rt")
#Going through cases
repeat {
line <- readLines(input, n=1)
if (length(line)==0) break
if (length(grep("Id:",line)) != 0) {
case <- case + 1 ; data[case,] <-NA
split_line <- strsplit(line,"Id:")
data[case,1] <- as.numeric(split_line[[1]][2])
}
}
#Closing connection
close(input)
#Saving dataframe
write.csv(data,'data.csv')
Kind regards,
Frederik
--
View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110414/e5983660/attachment.pl>
I have two suggestions to speed up your code, if you
must use a loop.
First, don't grow your output dataset at each iteration.
Instead of
cases <- 0
output <- numeric(cases)
while(length(line <- readLines(input, n=1))==1) {
cases <- cases + 1
output[cases] <- as.numeric(line)
}
preallocate the output vector to be about the size of
its eventual length (slightly bigger is better), replacing
output <- numeric(0)
with the likes of
output <- numeric(500000)
and when you are done with the loop trim down the length
if it is too big
if (cases < length(output)) length(output) <- cases
Growing your dataset in a loop can cause quadratic or worse
growth in time with problem size and the above sort of
code should make the time grow linearly with problem size.
Second, don't do data.frame subscripting inside your loop.
Instead of
data <- data.frame(Id=numeric(cases))
while(...) {
data[cases, 1] <- newValue
}
do
Id <- numeric(cases)
while(...) {
Id[cases] <- newValue
}
data <- data.frame(Id = Id)
This is just the general principal that you don't want to
repeat the same operation over and over in a loop.
dataFrame[i,j] first extracts column j then extracts element
i from that column. Since the column is the same every iteration
you may as well extract the column outside of the loop.
Avoiding the loop altogether is the fastest. E.g., the code
you showed does the same thing as
idLines <- grep(value=TRUE, "Id:", readLines(file))
data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*", "", idLines)))
You can also use an external process (perl or grep) to filter
out the lines that are not of interest.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of Freds
Sent: Wednesday, April 13, 2011 10:58 AM
To: r-help at r-project.org
Subject: Re: [R] Incremental ReadLines
Hi there,
I am having a similar problem with reading in a large text
file with around
550.000 observations with each 10 to 100 lines of
description. I am trying
to parse it in R but I have troubles with the size of the
file. It seems
like it is slowing down dramatically at some point. I would
be happy for any
suggestions. Here is my code, which works fine when I am
doing a subsample
of my dataset.
#Defining datasource
file <- "filename.txt"
#Creating placeholder for data and assigning column names
data <- data.frame(Id=NA)
#Starting by case = 0
case <- 0
#Opening a connection to data
input <- file(file, "rt")
#Going through cases
repeat {
line <- readLines(input, n=1)
if (length(line)==0) break
if (length(grep("Id:",line)) != 0) {
case <- case + 1 ; data[case,] <-NA
split_line <- strsplit(line,"Id:")
data[case,1] <- as.numeric(split_line[[1]][2])
}
}
#Closing connection
close(input)
#Saving dataframe
write.csv(data,'data.csv')
Kind regards,
Frederik
--
View this message in context:
http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
447859.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110414/fa8cc662/attachment.pl>
________________________________
Date: Thu, 14 Apr 2011 11:57:40 -0400 Subject: Re: [R] Incremental ReadLines From: frederiklang at gmail.com To: marchywka at hotmail.com CC: r-help at r-project.org Hi Mike, Thanks for your comment. I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done.
I'm not sure I have a good R answer, simply pointing out the likley isuse and maybe the rest belongs on r-develoiper list or something. If you can determine you are running out of physical memory, then you either need to partitition somehting or make accesses more regular. My favorite example from personal experience is sorting a data set prior to piping into a c++ program that changed the execution time substantially by avoiding VM thrashing. R either needs a swapping buffer or has an equivalent that someone else could mention.
I was under the impression that I had to do a loop since my blocks of observations are of varying length. Thanks again, Frederik On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka
wrote:
----------------------------------------
Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederiklang at gmail.com To: r-help at r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy
for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A "readline()" method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your "read" would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help.
suggestions. Here is my code, which works fine when I am doing a subsample
of my dataset.
#Defining datasource
file <- "filename.txt"
#Creating placeholder for data and assigning column names
data <- data.frame(Id=NA)
#Starting by case = 0
case <- 0
#Opening a connection to data
input <- file(file, "rt")
#Going through cases
repeat {
line <- readLines(input, n=1)
if (length(line)==0) break
if (length(grep("Id:",line)) != 0) {
case <- case + 1 ; data[case,] <-NA
split_line <- strsplit(line,"Id:")
data[case,1] <- as.numeric(split_line[[1]][2])
}
}
#Closing connection
close(input)
#Saving dataframe
write.csv(data,'data.csv')
Kind regards,
Frederik
--
View this message in context:
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[see below] From: Frederik Lang [mailto:frederiklang at gmail.com] Sent: Thursday, April 14, 2011 12:56 PM To: William Dunlap Cc: r-help at r-project.org Subject: Re: [R] Incremental ReadLines Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Doing the simple preallocation that I describe should speed it up a lot with very little effort. It is more work to manipulate the columns one at a time instead of using data.frame subscripting and it may not be worth it if you have lots of columns. If you have a lot of this sort of file and feel that it will be worth the programming time to do something fancier, here is some code that reads lines of the form
cat(lines, sep="\n")
Id: First Var1: false Var2: 6 Var3: 8 Id: Second Id: Last Var1: true Var3: 8 and produces a matrix with the Id's along the rows and the Var's along the columns:
f(lines)
Var1 Var2 Var3
First "false" "6" "8"
Second NA NA NA
Last "true" NA "8"
The function f is:
f <- function (lines)
{
# keep only lines with colons
lines <- grep(value = TRUE, "^.+:", lines)
lines <- gsub("^[[:space:]]+|[[:space:]]+$", "", lines)
isIdLine <- grepl("^Id:", lines)
group <- cumsum(isIdLine)
rownames <- sub("^Id:[[:space:]]*", "", lines[isIdLine])
lines <- lines[!isIdLine]
group <- group[!isIdLine]
varname <- sub("[[:space:]]*:.*$", "", lines)
value <- sub(".*:[[:space:]]*", "", lines)
colnames <- unique(varname)
col <- match(varname, colnames)
retval <- array(NA_character_, c(length(rownames),
length(colnames)),
dimnames = list(rownames, colnames))
retval[cbind(group, col)] <- value
retval
}
The main trick is the matrix subscript given to retval on the
penultimate line.
Thanks again,
Frederik
On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap
<wdunlap at tibco.com> wrote:
I have two suggestions to speed up your code, if you
must use a loop.
First, don't grow your output dataset at each iteration.
Instead of
cases <- 0
output <- numeric(cases)
while(length(line <- readLines(input, n=1))==1) {
cases <- cases + 1
output[cases] <- as.numeric(line)
}
preallocate the output vector to be about the size of
its eventual length (slightly bigger is better),
replacing
output <- numeric(0)
with the likes of
output <- numeric(500000)
and when you are done with the loop trim down the length
if it is too big
if (cases < length(output)) length(output) <- cases
Growing your dataset in a loop can cause quadratic or
worse
growth in time with problem size and the above sort of
code should make the time grow linearly with problem
size.
Second, don't do data.frame subscripting inside your
loop.
Instead of
data <- data.frame(Id=numeric(cases))
while(...) {
data[cases, 1] <- newValue
}
do
Id <- numeric(cases)
while(...) {
Id[cases] <- newValue
}
data <- data.frame(Id = Id)
This is just the general principal that you don't want
to
repeat the same operation over and over in a loop.
dataFrame[i,j] first extracts column j then extracts
element
i from that column. Since the column is the same every
iteration
you may as well extract the column outside of the loop.
Avoiding the loop altogether is the fastest. E.g., the
code
you showed does the same thing as
idLines <- grep(value=TRUE, "Id:", readLines(file))
data.frame(Id = as.numeric(sub("^.*Id:[[:space:]]*",
"", idLines)))
You can also use an external process (perl or grep) to
filter
out the lines that are not of interest.
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of
Freds
> Sent: Wednesday, April 13, 2011 10:58 AM
> To: r-help at r-project.org
> Subject: Re: [R] Incremental ReadLines
>
> Hi there,
>
> I am having a similar problem with reading in a large
text
> file with around
> 550.000 observations with each 10 to 100 lines of
> description. I am trying
> to parse it in R but I have troubles with the size of
the
> file. It seems
> like it is slowing down dramatically at some point. I
would
> be happy for any
> suggestions. Here is my code, which works fine when I
am
> doing a subsample
> of my dataset.
>
> #Defining datasource
> file <- "filename.txt"
>
> #Creating placeholder for data and assigning column
names
> data <- data.frame(Id=NA)
>
> #Starting by case = 0
> case <- 0
>
> #Opening a connection to data
> input <- file(file, "rt")
>
> #Going through cases
> repeat {
> line <- readLines(input, n=1)
> if (length(line)==0) break
> if (length(grep("Id:",line)) != 0) {
> case <- case + 1 ; data[case,] <-NA
> split_line <- strsplit(line,"Id:")
> data[case,1] <- as.numeric(split_line[[1]][2])
> }
> }
>
> #Closing connection
> close(input)
>
> #Saving dataframe
> write.csv(data,'data.csv')
>
>
> Kind regards,
>
>
> Frederik
>
>
> --
> View this message in context:
>
http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3
447859.html
<http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3%0A447859
.html>
> Sent from the R help mailing list archive at
Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
reproducible code.
>
3 days later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110417/f8b4680f/attachment.pl>