Incremental ReadLines
----------------------------------------
Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederiklang at gmail.com To: r-help at r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any
This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A "readline()" method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your "read" would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help.
suggestions. Here is my code, which works fine when I am doing a subsample
of my dataset.
#Defining datasource
file <- "filename.txt"
#Creating placeholder for data and assigning column names
data <- data.frame(Id=NA)
#Starting by case = 0
case <- 0
#Opening a connection to data
input <- file(file, "rt")
#Going through cases
repeat {
line <- readLines(input, n=1)
if (length(line)==0) break
if (length(grep("Id:",line)) != 0) {
case <- case + 1 ; data[case,] <-NA
split_line <- strsplit(line,"Id:")
data[case,1] <- as.numeric(split_line[[1]][2])
}
}
#Closing connection
close(input)
#Saving dataframe
write.csv(data,'data.csv')
Kind regards,
Frederik
--
View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.