Skip to content

"Best" way to merge 300+ .5MB dataframes?

11 messages · Jeff Newmiller, John McKown, Thomas Adams +3 more

#
Good afternoon,

Today I was working on a practice problem. It was simple, and perhaps
even realistic. It looked like this:

? Get a list of all the data files in a directory
? Load each file into a dataframe
? Merge them into a single data frame

Because all of the columns were the same, the simplest solution in my
mind was to `Reduce' the vector of dataframes with a call to
`merge'. That worked fine, I got what was expected. That is key
actually. It is literally a one-liner, and there will never be index
or scoping errors with it.

Now with that in mind, what is the idiomatic way? Do people usually do
something else because it is /faster/ (by some definition)?

Kind regards,

Grant Rettke | ACM, ASA, FSF, IEEE, SIAM
gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/
?Wisdom begins in wonder.? --Socrates
((? (x) (x x)) (? (x) (x x)))
?Life has become immeasurably better since I have been forced to stop
taking it seriously.? --Thompson
#
Just load the data frames into a list and give that list to rbind. It is way more efficient to be able to identify how big the final data frame is going to have to be at the beginning and preallocate the result memory than to incrementally allocate larger and larger data frames along the way using Reduce.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.
On August 10, 2014 11:51:22 AM PDT, Grant Rettke <gcr at wisdomandwonder.com> wrote:
#
On Aug 10, 2014, at 11:51 AM, Grant Rettke wrote:

            
Something along these lines:

all <- do.call( rbind, 
                 lapply( list.files(path=getwd(), pattern=".csv"), 
                         read.csv) )

Possibly:

all <- sapply( list.files(path=getwd(), pattern=".csv"), 
                         read.csv)

Untested since no reproducible example was offered. This skips the task of individually assigning names to the input dataframes. There are quite a few variations on this in the Archives. You should learn to search them. Rseek.org or MarkMail are effective for me.

http://www.rseek.org/

http://markmail.org/search/?q=list%3Aorg.r-project.r-help
You might have forced `merge` to work with the correct choice of arguments but I would have silently eliminated duplicate rows. Seems unlikely to me that it would be efficient for the purpose of just stacking dataframe values.
[1] a b
<0 rows> (or 0-length row.names)
a b
1 1 2
2 3 4
a b
1 1 2
a b
1 1 2
2 1 2

  
    
#
Err... sorry... you have to use do.call with base rbind as David illustrates. I am spoiled by rbind.fill from the plyr package. rbind.fill accepts the list directly and also fills in any missing columns with NA, which avoids having to dig through all the files to find any oddballs.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.
On August 10, 2014 2:22:06 PM PDT, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
#
On Sun, Aug 10, 2014 at 1:51 PM, Grant Rettke <gcr at wisdomandwonder.com> wrote:
OK, I assume this results in a vector of file names in a variable,
like you'd get from list.files();
Why? Do you need them in separate data frames?
The meat of the question. If you don't need the files in separate data
frames, and the files do _NOT_ have headers, then I would just load
them all into a single frame. I used Linux and so my solution may not
work on Windows. Something like:

list_of_files = list.files(pattern=".*data$"); # list of data files
#
# command to list contents of all files to stdout:
command <- pipe(paste('cat',list_of_files));
read.table(command,header=FALSE);

I would guess that Windows has something equivalent to cat, is it
"type"? I have a vague memory of that.

The above will work with header=TRUE, but the headers in the second
and subsequent files are taken as data. And if you have row.names in
the data, such as write.csv() does, then this is really not for you.
Well, at least it would not be as simple. There are ways around it
using a more intelligent "copy" program than "cat". Such as AWK. If
you need an AWK example, I can fake one up. It would strip the headers
from the 2nd and subsequent files and remove the first column
"row.names" values. Not really all that difficult, but "fiddly".

  
    
#
On Sun, Aug 10, 2014 at 6:50 PM, John McKown
<john.archie.mckown at gmail.com> wrote:

            
Yes.
I do not.
Excellent point. All of the files do have the same header. I'm on OSX
so there must be a nice
one liner to concatenate all of the individual files, dropping the
first line for all but the first.  Danke!
#
On Mon, Aug 11, 2014 at 9:43 PM, Thomas Adams <tea3rd at gmail.com> wrote:
Using sed hadn't occurred to me. I guess I'm just "awk-ward" <grin/>.
A slightly different way would be:

for file in *.txt;do
  sed '1d' ${file}
done >newfilename.txt

that way the original files are not modified.  But it strips out the
header on the 1st file as well. Not a big deal, but the read.table
will need to be changed to accommodate that. Also, it creates an
otherwise unnecessary intermediate file "newfilename.txt". To get the
1st file's header, the script could:

head -1 >newfilename.txt
for file in *.txt;do
   sed '1d' ${file}
done >>newfilename.txt

I really like having multiple answers to a given problem. Especially
since I have a poorly implemented version of "awk" on one of my
systems. It is the vendor's "awk" and conforms exactly to the POSIX
definition with no additions. So I don't have the FNR built-in
variable. Your implementation would work well on that system. Well, if
there were a version of R for it. It is a branded UNIX system which
was designed to be totally __and only__ POSIX compliant, with few
(maybe no) extensions at all. IOW, it stinks. No, it can't be
replaced. It is the z/OS system from IBM which is EBCDIC based and
runs on the "big iron" mainframe, system z.
#
On Aug 11, 2014, at 8:01 PM, John McKown wrote:

            
On the Mac the awk equivalent is gawk. Within R you would use `system()` possibly using paste0() to construct a string to send.
#
On 12/08/2014 07:07, David Winsemius wrote:
For historical reasons this is actually part of R's configuration: see 
the AWK entry in R_HOME/etc/Makeconf.  (There is an SED entry too: not 
all sed's in current OSes are POSIX-compliant.)

Using system2() rather than system() is recommended for new code.
#
Thank you all kindly.
Grant Rettke | ACM, AMA, COG, IEEE
grettke at acm.org | http://www.wisdomandwonder.com/
?Wisdom begins in wonder.? --Socrates
((? (x) (x x)) (? (x) (x x)))
?Life has become immeasurably better since I have been forced to stop
taking it seriously.? --Thompson
On Tue, Aug 12, 2014 at 1:07 AM, David Winsemius <dwinsemius at comcast.net> wrote: