Skip to content

Reading in large file in pieces

4 messages · Sean Davis, Ales Ziberna, Brian Ripley

#
I have a large file (millions of lines) and would like to read it in pieces.
The file is logically separated into little modules, but these modules do
not have a common size, so I have to scan the file to know where they are.
They are independent, so I don't have to read one at the end to interpret
one at the beginning.  Is there a way to read one line at a time and parse
it on the fly and do so quickly, or do I need to read say 100k lines at a
time and then work with those?  Only a small piece of each module will
remain in memory after parsing is completed on each module.

My direct question is:  Is there a fast way to parse one line at a time
looking for breaks between "modules", or am I better off taking large but
manageable chunks from the file and parsing that chunk all at once?

Thanks,
Sean
#
See ?scan
or maybe ?readLines


----- Original Message ----- 
From: "Sean Davis" <sdavis2 at mail.nih.gov>
To: "r-help" <r-help at stat.math.ethz.ch>
Sent: Friday, December 23, 2005 12:08 AM
Subject: [R] Reading in large file in pieces


I have a large file (millions of lines) and would like to read it in pieces.
The file is logically separated into little modules, but these modules do
not have a common size, so I have to scan the file to know where they are.
They are independent, so I don't have to read one at the end to interpret
one at the beginning.  Is there a way to read one line at a time and parse
it on the fly and do so quickly, or do I need to read say 100k lines at a
time and then work with those?  Only a small piece of each module will
remain in memory after parsing is completed on each module.

My direct question is:  Is there a fast way to parse one line at a time
looking for breaks between "modules", or am I better off taking large but
manageable chunks from the file and parsing that chunk all at once?

Thanks,
Sean

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html
#
On Thu, 22 Dec 2005, Sean Davis wrote:

            
On any reasonable OS (you have not told us yours), it will make no 
difference as the file reads will be buffered.  Assuming you are doing 
something like opening a connection and calling readLines(n=1), of course.
#
On 12/23/05 2:41 AM, "Prof Brian Ripley" <ripley at stats.ox.ac.uk> wrote:

            
Thanks.  That is indeed the answer, and you are correct that it is quite
fast on MacOS 10.4.4.  Most importantly, it does successfully reduce memory
usage for my program by an order of magnitude (+/-).

Sean