Skip to content

read multiple large files into one dataframe

5 messages · SYKES, Jennifer, Baptiste Auguie, Michael Lawrence +2 more

#
I'd first try plyr and see if it's efficient enough,
alternatively,
HTH,

baptiste
On 13 May 2009, at 12:45, SYKES, Jennifer wrote:

            
_____________________________

Baptiste Augui?

School of Physics
University of Exeter
Stocker Road,
Exeter, Devon,
EX4 4QL, UK

Phone: +44 1392 264187

http://newton.ex.ac.uk/research/emag
#
What types of data are in each file? All numbers, or a mix of numbers
and characters? Any missing data or special NA values?

On Wed, May 13, 2009 at 7:45 AM, SYKES, Jennifer
<Jennifer.SYKES at nats.co.uk> wrote:

  
    
#
can you provide reproducible code please?

even a fake example would help.

I would

1) set up a loop to read in each file from a directory
2)  inside the loop chop up/ aggregate the data, each file in turn and spit 
each new aggreagated file out to a directory using write.table(). This will 
reduce the memory needed by only including the info you want. Make sure each 
file is a data frame with the same names.
3) set up a new loop to read in each new small file and rbind them all 
together to make your new "master file".

The R gurus may have a more parsimonious solution.

HTH

Simon.


----- Original Message ----- 
From: "SYKES, Jennifer" <Jennifer.SYKES at nats.co.uk>
To: <r-help at r-project.org>
Sent: Wednesday, May 13, 2009 11:45 AM
Subject: [R] read multiple large files into one dataframe
#
A few points to consider:

- If all the data are numeric, then use matrices instead of data frames.

- With either data frames or matrices, there is no way (that I'm aware
of anyway) in R to stack them without making at least one copy in
memory.

- Since none of the files has a header row, I would concatenate them
into one file outside R (e.g., on *nix, cat * > all.txt) and then read
that in.  You can also try it inside R with something like
read.table(pipe()).  You will want to make use of the colClasses
argument in read.table() to specify the column types, though, to ensure
that read.table() only go through the input once.

- You're probably better off getting the data into a database (even
something like sqlite) and use an R interface to that database.

- 30MB x 90 = 2.7GB.  Unless you're on a 64-bit machine with lots of
RAM, you're not likely to have much fun with the data even when you
manage to get it into R in one piece.

Andy

From: SYKES, Jennifer
Notice:  This e-mail message, together with any attachme...{{dropped:12}}