Back to formatted view
Raw Message

Message-ID: <CALNVJ7N9veMRamX4eU-ALFJacK_s0=yXHEXbGK6U5QE7k75bNw@mail.gmail.com>
Date: 2011-10-18T18:06:12Z
From: johannes rara
Subject: How to read data sequentially into R (line by line)?
In-Reply-To: <CAAxdm-6_fuGRpXgz1V0UqF6UT=CV2OymiZqx3auE-6SFNMA0LA@mail.gmail.com>

Thank you Jim for your kind reply. My intention was to split one 14M
file into less than 15 text files, each of them having ~1M lines. The
idea was to make sure that one "sequence"

GG!KK!KK! --sequence start
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK! --sequence end

does not break into parts between those files so that e.g at the end
of the first file (containing ~1M lines) has
...
GG!KK!KK! --sequence start
APE!KKU!684!
APE!VAL!!
APE!UASU!!
--no sequence end here!

and the beginning of the second file

--no sequence start here!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK! --sequence end
...

-J

2011/10/18 jim holtman <jholtman at gmail.com>:
> I thought that you wanted a separate file for each of the breaks
> "GG!KK!KK!". ?If you want to read in some large number of lines and
> then break them so that they have that many lines, you can do the same
> thing, except scanning from the back for a break. ?So if your input
> file has 14M breaks in it, then the code I sent would create that many
> files. ?If you want a minimum number of lines per file, including the
> breaks, then it can be done. ?You just have to be clearer on exactly
> what the requirement are. ?From your sample data, it looks like there
> were 7 text lines per record, so if your input was 14M lines, I would
> expect that you would have something in the neighborhood of 1.8M files
> with 7 lines each. ?If you had 14M lines in the file and you were
> generating 14M files, then there is something wrong with your code is
> that it is not recognizing the breaks. ?How many lines did each file
> have in it?
>
> On Tue, Oct 18, 2011 at 9:36 AM, johannes rara <johannesraja at gmail.com> wrote:
>> Thanks Jim for your help. I tried this code using readLines and it
>> works but not in way I wanted. It seems that this code is trying to
>> separate all records from a text file so that I'm getting over 14 000
>> 000 text files. My intention is to get only 15 text files all expect
>> one containing 1 000 000 rows so that the record which is on the
>> breakpoint (near at 1 000 000 line) does not cut from the "middle"...
>>
>> -J
>>
>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>> Use 'readLines' instead of 'read.table'. ?We want to read in the text
>>> file and convert it into separate text files, each of which can then
>>> be read in using 'read.table'. ?My solution assumes that you have used
>>> readLines. ?Trying to do this with data frames gets messy. ?Keep it
>>> simple and do it in two phases; makes it easier to debug and to see
>>> what is going on.
>>>
>>>
>>>
>>> On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>> Thanks Jim,
>>>>
>>>> I tried to convert this solution into my situation (.txt file as an input);
>>>>
>>>> zz <- file("myfile.txt", "r")
>>>>
>>>> fileNo <- 1 ?# used for file name
>>>> buffer <- NULL
>>>> repeat{
>>>> ? input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>> row.names=NULL, na.strings="")
>>>> ? if (length(input) == 0) break ?# done
>>>> ? buffer <- c(buffer, input)
>>>> ? # find separator
>>>> ? repeat{
>>>> ? ? ? indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>> ? ? ? if (is.na(indx)) break ?# not found yet; read more
>>>> ? ? ? writeLines(buffer[1:(indx - 1L)]
>>>> ? ? ? ? ? , sprintf("newFile%04d.txt", fileNo)
>>>> ? ? ? ? ? )
>>>> ? ? ? buffer <- buffer[-c(1:indx)] ?# remove data
>>>> ? ? ? fileNo <- fileNo + 1
>>>> ? }
>>>> }
>>>>
>>>> but it gives me an error
>>>>
>>>> Error in read.table(file = file, header = header, sep = sep, quote = quote, ?:
>>>> ?no lines available in input
>>>>>
>>>>
>>>> Do you know a reason for this?
>>>>
>>>> -J
>>>>
>>>> 2011/10/18 jim holtman <jholtman at gmail.com>:
>>>>> Let's do it in two parts: first create all the separate files (which
>>>>> if this what you are after, we can stop here). ?You can change the
>>>>> value on readLines to read in as many lines as you want; I set it to 2
>>>>> just for testing.
>>>>>
>>>>> x <- textConnection("APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!
>>>>> APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!
>>>>> APE!KKU!684!
>>>>> APE!VAL!!
>>>>> APE!UASU!!
>>>>> APE!PLA!1!
>>>>> APE!E!10!
>>>>> APE!TPVA!17122009!
>>>>> APE!STAP!1!
>>>>> GG!KK!KK!")
>>>>>
>>>>> fileNo <- 1 ?# used for file name
>>>>> buffer <- NULL
>>>>> repeat{
>>>>> ? ?input <- readLines(x, n = 100)
>>>>> ? ?if (length(input) == 0) break ?# done
>>>>> ? ?buffer <- c(buffer, input)
>>>>> ? ?# find separator
>>>>> ? ?repeat{
>>>>> ? ? ? ?indx <- which(grepl("^GG!KK!KK!", buffer))[1]
>>>>> ? ? ? ?if (is.na(indx)) break ?# not found yet; read more
>>>>> ? ? ? ?writeLines(buffer[1:(indx - 1L)]
>>>>> ? ? ? ? ? ?, sprintf("newFile%04d", fileNo)
>>>>> ? ? ? ? ? ?)
>>>>> ? ? ? ?buffer <- buffer[-c(1:indx)] ?# remove data
>>>>> ? ? ? ?fileNo <- fileNo + 1
>>>>> ? ?}
>>>>> }
>>>>>
>>>>>
>>>>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesraja at gmail.com> wrote:
>>>>>> I have a data set like this in one .txt file (cols separated by !):
>>>>>>
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>> APE!KKU!684!
>>>>>> APE!VAL!!
>>>>>> APE!UASU!!
>>>>>> APE!PLA!1!
>>>>>> APE!E!10!
>>>>>> APE!TPVA!17122009!
>>>>>> APE!STAP!1!
>>>>>> GG!KK!KK!
>>>>>>
>>>>>> it contains over 14 000 000 records. Now because I'm out of memory
>>>>>> when trying to handle this data in R, I'm trying to read it
>>>>>> sequentially and write it out in several .csv files (or .RData files)
>>>>>> and then read these into R one-by-one. One record in this data is
>>>>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
>>>>>> Holtman's approach
>>>>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
>>>>>> problem is how to avoid cutting one record from the middle? I mean
>>>>>> that if I put nrows = 1000000, I don't know if one record (between
>>>>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
>>>>>> that? My code so far:
>>>>>>
>>>>>> zz <- file("myfile.txt", "r")
>>>>>> fileNo <- 1
>>>>>> repeat{
>>>>>>
>>>>>> ? ?gotError <- 1 # set to 2 if there is an error ? ? # catch the
>>>>>> error if not more data
>>>>>> ? ?tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
>>>>>> row.names=NULL, na.strings="", header=FALSE),
>>>>>> ? ? ? ? ? ? ?error=function(x) gotError <<- 2)
>>>>>>
>>>>>> ? ?if (gotError == 2) break
>>>>>> ? ?# save the intermediate data
>>>>>> ? ?save(input, file=sprintf("file%03d.RData", fileNo))
>>>>>> ? ?fileNo <- fileNo + 1
>>>>>> }
>>>>>> close(zz)
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Data Munger Guru
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
>