I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ gotError <- 1 # set to 2 if there is an error # catch the error if not more data tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), error=function(x) gotError <<- 2) if (gotError == 2) break # save the intermediate data save(input, file=sprintf("file%03d.RData", fileNo)) fileNo <- fileNo + 1 } close(zz)
How to read data sequentially into R (line by line)?
8 messages · johannes rara, jim holtman
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 # used for file name
buffer <- NULL
repeat{
input <- readLines(x, n = 100)
if (length(input) == 0) break # done
buffer <- c(buffer, input)
# find separator
repeat{
indx <- which(grepl("^GG!KK!KK!", buffer))[1]
if (is.na(indx)) break # not found yet; read more
writeLines(buffer[1:(indx - 1L)]
, sprintf("newFile%04d", fileNo)
)
buffer <- buffer[-c(1:indx)] # remove data
fileNo <- fileNo + 1
}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the error if not more data ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), ? ? ? ? ? ? ?error=function(x) gotError <<- 2) ? ?if (gotError == 2) break ? ?# save the intermediate data ? ?save(input, file=sprintf("file%03d.RData", fileNo)) ? ?fileNo <- fileNo + 1 } close(zz)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 # used for file name
buffer <- NULL
repeat{
input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
if (length(input) == 0) break # done
buffer <- c(buffer, input)
# find separator
repeat{
indx <- which(grepl("^GG!KK!KK!", buffer))[1]
if (is.na(indx)) break # not found yet; read more
writeLines(buffer[1:(indx - 1L)]
, sprintf("newFile%04d.txt", fileNo)
)
buffer <- buffer[-c(1:indx)] # remove data
fileNo <- fileNo + 1
}
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
no lines available in input
Do you know a reason for this? -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). ?You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? ?input <- readLines(x, n = 100)
? ?if (length(input) == 0) break ?# done
? ?buffer <- c(buffer, input)
? ?# find separator
? ?repeat{
? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? ?if (is.na(indx)) break ?# not found yet; read more
? ? ? ?writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
? ? ? ? ? ?)
? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? ?fileNo <- fileNo + 1
? ?}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the error if not more data ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), ? ? ? ? ? ? ?error=function(x) gotError <<- 2) ? ?if (gotError == 2) break ? ?# save the intermediate data ? ?save(input, file=sprintf("file%03d.RData", fileNo)) ? ?fileNo <- fileNo + 1 } close(zz)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Use 'readLines' instead of 'read.table'. We want to read in the text file and convert it into separate text files, each of which can then be read in using 'read.table'. My solution assumes that you have used readLines. Trying to do this with data frames gets messy. Keep it simple and do it in two phases; makes it easier to debug and to see what is going on.
On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
? if (length(input) == 0) break ?# done
? buffer <- c(buffer, input)
? # find separator
? repeat{
? ? ? indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? if (is.na(indx)) break ?# not found yet; read more
? ? ? writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? , sprintf("newFile%04d.txt", fileNo)
? ? ? ? ? )
? ? ? buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? fileNo <- fileNo + 1
? }
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, ?:
?no lines available in input
Do you know a reason for this? -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). ?You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? ?input <- readLines(x, n = 100)
? ?if (length(input) == 0) break ?# done
? ?buffer <- c(buffer, input)
? ?# find separator
? ?repeat{
? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? ?if (is.na(indx)) break ?# not found yet; read more
? ? ? ?writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
? ? ? ? ? ?)
? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? ?fileNo <- fileNo + 1
? ?}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the error if not more data ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), ? ? ? ? ? ? ?error=function(x) gotError <<- 2) ? ?if (gotError == 2) break ? ?# save the intermediate data ? ?save(input, file=sprintf("file%03d.RData", fileNo)) ? ?fileNo <- fileNo + 1 } close(zz)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Thanks Jim for your help. I tried this code using readLines and it works but not in way I wanted. It seems that this code is trying to separate all records from a text file so that I'm getting over 14 000 000 text files. My intention is to get only 15 text files all expect one containing 1 000 000 rows so that the record which is on the breakpoint (near at 1 000 000 line) does not cut from the "middle"... -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Use 'readLines' instead of 'read.table'. ?We want to read in the text file and convert it into separate text files, each of which can then be read in using 'read.table'. ?My solution assumes that you have used readLines. ?Trying to do this with data frames gets messy. ?Keep it simple and do it in two phases; makes it easier to debug and to see what is going on. On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
? if (length(input) == 0) break ?# done
? buffer <- c(buffer, input)
? # find separator
? repeat{
? ? ? indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? if (is.na(indx)) break ?# not found yet; read more
? ? ? writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? , sprintf("newFile%04d.txt", fileNo)
? ? ? ? ? )
? ? ? buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? fileNo <- fileNo + 1
? }
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, ?:
?no lines available in input
Do you know a reason for this? -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). ?You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? ?input <- readLines(x, n = 100)
? ?if (length(input) == 0) break ?# done
? ?buffer <- c(buffer, input)
? ?# find separator
? ?repeat{
? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? ?if (is.na(indx)) break ?# not found yet; read more
? ? ? ?writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
? ? ? ? ? ?)
? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? ?fileNo <- fileNo + 1
? ?}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the error if not more data ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), ? ? ? ? ? ? ?error=function(x) gotError <<- 2) ? ?if (gotError == 2) break ? ?# save the intermediate data ? ?save(input, file=sprintf("file%03d.RData", fileNo)) ? ?fileNo <- fileNo + 1 } close(zz)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
I thought that you wanted a separate file for each of the breaks "GG!KK!KK!". If you want to read in some large number of lines and then break them so that they have that many lines, you can do the same thing, except scanning from the back for a break. So if your input file has 14M breaks in it, then the code I sent would create that many files. If you want a minimum number of lines per file, including the breaks, then it can be done. You just have to be clearer on exactly what the requirement are. From your sample data, it looks like there were 7 text lines per record, so if your input was 14M lines, I would expect that you would have something in the neighborhood of 1.8M files with 7 lines each. If you had 14M lines in the file and you were generating 14M files, then there is something wrong with your code is that it is not recognizing the breaks. How many lines did each file have in it?
On Tue, Oct 18, 2011 at 9:36 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim for your help. I tried this code using readLines and it works but not in way I wanted. It seems that this code is trying to separate all records from a text file so that I'm getting over 14 000 000 text files. My intention is to get only 15 text files all expect one containing 1 000 000 rows so that the record which is on the breakpoint (near at 1 000 000 line) does not cut from the "middle"... -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Use 'readLines' instead of 'read.table'. ?We want to read in the text file and convert it into separate text files, each of which can then be read in using 'read.table'. ?My solution assumes that you have used readLines. ?Trying to do this with data frames gets messy. ?Keep it simple and do it in two phases; makes it easier to debug and to see what is going on. On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
? if (length(input) == 0) break ?# done
? buffer <- c(buffer, input)
? # find separator
? repeat{
? ? ? indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? if (is.na(indx)) break ?# not found yet; read more
? ? ? writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? , sprintf("newFile%04d.txt", fileNo)
? ? ? ? ? )
? ? ? buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? fileNo <- fileNo + 1
? }
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, ?:
?no lines available in input
Do you know a reason for this? -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). ?You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? ?input <- readLines(x, n = 100)
? ?if (length(input) == 0) break ?# done
? ?buffer <- c(buffer, input)
? ?# find separator
? ?repeat{
? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? ?if (is.na(indx)) break ?# not found yet; read more
? ? ? ?writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
? ? ? ? ? ?)
? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? ?fileNo <- fileNo + 1
? ?}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the error if not more data ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), ? ? ? ? ? ? ?error=function(x) gotError <<- 2) ? ?if (gotError == 2) break ? ?# save the intermediate data ? ?save(input, file=sprintf("file%03d.RData", fileNo)) ? ?fileNo <- fileNo + 1 } close(zz)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
Thank you Jim for your kind reply. My intention was to split one 14M file into less than 15 text files, each of them having ~1M lines. The idea was to make sure that one "sequence" GG!KK!KK! --sequence start APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! --sequence end does not break into parts between those files so that e.g at the end of the first file (containing ~1M lines) has ... GG!KK!KK! --sequence start APE!KKU!684! APE!VAL!! APE!UASU!! --no sequence end here! and the beginning of the second file --no sequence start here! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! --sequence end ... -J 2011/10/18 jim holtman <jholtman at gmail.com>:
I thought that you wanted a separate file for each of the breaks "GG!KK!KK!". ?If you want to read in some large number of lines and then break them so that they have that many lines, you can do the same thing, except scanning from the back for a break. ?So if your input file has 14M breaks in it, then the code I sent would create that many files. ?If you want a minimum number of lines per file, including the breaks, then it can be done. ?You just have to be clearer on exactly what the requirement are. ?From your sample data, it looks like there were 7 text lines per record, so if your input was 14M lines, I would expect that you would have something in the neighborhood of 1.8M files with 7 lines each. ?If you had 14M lines in the file and you were generating 14M files, then there is something wrong with your code is that it is not recognizing the breaks. ?How many lines did each file have in it? On Tue, Oct 18, 2011 at 9:36 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim for your help. I tried this code using readLines and it works but not in way I wanted. It seems that this code is trying to separate all records from a text file so that I'm getting over 14 000 000 text files. My intention is to get only 15 text files all expect one containing 1 000 000 rows so that the record which is on the breakpoint (near at 1 000 000 line) does not cut from the "middle"... -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Use 'readLines' instead of 'read.table'. ?We want to read in the text file and convert it into separate text files, each of which can then be read in using 'read.table'. ?My solution assumes that you have used readLines. ?Trying to do this with data frames gets messy. ?Keep it simple and do it in two phases; makes it easier to debug and to see what is going on. On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
Thanks Jim,
I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="")
? if (length(input) == 0) break ?# done
? buffer <- c(buffer, input)
? # find separator
? repeat{
? ? ? indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? if (is.na(indx)) break ?# not found yet; read more
? ? ? writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? , sprintf("newFile%04d.txt", fileNo)
? ? ? ? ? )
? ? ? buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? fileNo <- fileNo + 1
? }
}
but it gives me an error
Error in read.table(file = file, header = header, sep = sep, quote = quote, ?:
?no lines available in input
Do you know a reason for this? -J 2011/10/18 jim holtman <jholtman at gmail.com>:
Let's do it in two parts: first create all the separate files (which
if this what you are after, we can stop here). ?You can change the
value on readLines to read in as many lines as you want; I set it to 2
just for testing.
x <- textConnection("APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!")
fileNo <- 1 ?# used for file name
buffer <- NULL
repeat{
? ?input <- readLines(x, n = 100)
? ?if (length(input) == 0) break ?# done
? ?buffer <- c(buffer, input)
? ?# find separator
? ?repeat{
? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
? ? ? ?if (is.na(indx)) break ?# not found yet; read more
? ? ? ?writeLines(buffer[1:(indx - 1L)]
? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
? ? ? ? ? ?)
? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
? ? ? ?fileNo <- fileNo + 1
? ?}
}
On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the error if not more data ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), ? ? ? ? ? ? ?error=function(x) gotError <<- 2) ? ?if (gotError == 2) break ? ?# save the intermediate data ? ?save(input, file=sprintf("file%03d.RData", fileNo)) ? ?fileNo <- fileNo + 1 } close(zz)
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
-- Jim Holtman Data Munger Guru What is the problem that you are trying to solve?
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20111018/af3ea5a1/attachment.pl>