Use 'readLines' instead of 'read.table'. ?We want to read in the text
file and convert it into separate text files, each of which can then
be read in using 'read.table'. ?My solution assumes that you have used
readLines. ?Trying to do this with data frames gets messy. ?Keep it
simple and do it in two phases; makes it easier to debug and to see
what is going on.
On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
? if (length(input) == 0) break ?# done
? buffer <- c(buffer, input)
? # find separator
? repeat{
? ? ? indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? if (is.na(indx)) break ?# not found yet; read more
? ? ? writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? , sprintf("newFile%04d.txt", fileNo)
? ? ? ? ? )
? ? ? buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? fileNo <- fileNo + 1
? }
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, ?:
?no lines available in input
Do you know a reason for this?
-J
2011/10/18 jim holtman <jholtman at gmail.com>:
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). ?You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? ?input <- readLines(x, n = 100)
? ?if (length(input) == 0) break ?# done
? ?buffer <- c(buffer, input)
? ?# find separator
? ?repeat{
? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? ?if (is.na(indx)) break ?# not found yet; read more
? ? ? ?writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
? ? ? ? ? ?)
? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? ?fileNo <- fileNo + 1
? ?}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !):
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
it contains over 14 000 000 records. Now because I'm out of memory
when trying to handle this data in R, I'm trying to read it
sequentially and write it out in several .csv files (or .RData files)
and then read these into R one-by-one. One record in this data is
between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
Holtman's approach
(http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
problem is how to avoid cutting one record from the middle? I mean
that if I put nrows = 1000000, I don't know if one record (between
marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
that? My code so far:
zz <- file("myfile.txt", "r")
fileNo <- 1
repeat{
? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the
error if not more data
? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="", header=FALSE),
? ? ? ? ? ? ?error=function(x) gotError <<- 2)
? ?if (gotError == 2) break
? ?# save the intermediate data
? ?save(input, file=sprintf("file%03d.RData", fileNo))
? ?fileNo <- fileNo + 1
}
close(zz)