Good afternoon, Today I was working on a practice problem. It was simple, and perhaps even realistic. It looked like this: ? Get a list of all the data files in a directory ? Load each file into a dataframe ? Merge them into a single data frame Because all of the columns were the same, the simplest solution in my mind was to `Reduce' the vector of dataframes with a call to `merge'. That worked fine, I got what was expected. That is key actually. It is literally a one-liner, and there will never be index or scoping errors with it. Now with that in mind, what is the idiomatic way? Do people usually do something else because it is /faster/ (by some definition)? Kind regards, Grant Rettke | ACM, ASA, FSF, IEEE, SIAM gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/ ?Wisdom begins in wonder.? --Socrates ((? (x) (x x)) (? (x) (x x))) ?Life has become immeasurably better since I have been forced to stop taking it seriously.? --Thompson
"Best" way to merge 300+ .5MB dataframes?
11 messages · Jeff Newmiller, John McKown, Thomas Adams +3 more
Just load the data frames into a list and give that list to rbind. It is way more efficient to be able to identify how big the final data frame is going to have to be at the beginning and preallocate the result memory than to incrementally allocate larger and larger data frames along the way using Reduce.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On August 10, 2014 11:51:22 AM PDT, Grant Rettke <gcr at wisdomandwonder.com> wrote:
Good afternoon, Today I was working on a practice problem. It was simple, and perhaps even realistic. It looked like this: ? Get a list of all the data files in a directory ? Load each file into a dataframe ? Merge them into a single data frame Because all of the columns were the same, the simplest solution in my mind was to `Reduce' the vector of dataframes with a call to `merge'. That worked fine, I got what was expected. That is key actually. It is literally a one-liner, and there will never be index or scoping errors with it. Now with that in mind, what is the idiomatic way? Do people usually do something else because it is /faster/ (by some definition)? Kind regards, Grant Rettke | ACM, ASA, FSF, IEEE, SIAM gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/ ?Wisdom begins in wonder.? --Socrates ((? (x) (x x)) (? (x) (x x))) ?Life has become immeasurably better since I have been forced to stop taking it seriously.? --Thompson
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Aug 10, 2014, at 11:51 AM, Grant Rettke wrote:
Good afternoon, Today I was working on a practice problem. It was simple, and perhaps even realistic. It looked like this: ? Get a list of all the data files in a directory ? Load each file into a dataframe ? Merge them into a single data frame
Something along these lines:
all <- do.call( rbind,
lapply( list.files(path=getwd(), pattern=".csv"),
read.csv) )
Possibly:
all <- sapply( list.files(path=getwd(), pattern=".csv"),
read.csv)
Untested since no reproducible example was offered. This skips the task of individually assigning names to the input dataframes. There are quite a few variations on this in the Archives. You should learn to search them. Rseek.org or MarkMail are effective for me.
http://www.rseek.org/
http://markmail.org/search/?q=list%3Aorg.r-project.r-help
Because all of the columns were the same, the simplest solution in my mind was to `Reduce' the vector of dataframes with a call to `merge'. That worked fine, I got what was expected. That is key actually. It is literally a one-liner, and there will never be index or scoping errors with it.
You might have forced `merge` to work with the correct choice of arguments but I would have silently eliminated duplicate rows. Seems unlikely to me that it would be efficient for the purpose of just stacking dataframe values.
merge( data.frame(a=1, b=2), data.frame(a=3, b=4) )
[1] a b <0 rows> (or 0-length row.names)
merge( data.frame(a=1, b=2), data.frame(a=3, b=4) , all=TRUE)
a b 1 1 2 2 3 4
merge( data.frame(a=1, b=2), data.frame(a=1, b=2) )
a b 1 1 2
rbind( data.frame(a=1, b=2), data.frame(a=1, b=2) )
a b 1 1 2 2 1 2
Now with that in mind, what is the idiomatic way? Do people usually do something else because it is /faster/ (by some definition)? Kind regards,
David Winsemius Alameda, CA, USA
Err... sorry... you have to use do.call with base rbind as David illustrates. I am spoiled by rbind.fill from the plyr package. rbind.fill accepts the list directly and also fills in any missing columns with NA, which avoids having to dig through all the files to find any oddballs.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On August 10, 2014 2:22:06 PM PDT, Jeff Newmiller <jdnewmil at dcn.davis.ca.us> wrote:
Just load the data frames into a list and give that list to rbind. It
is way more efficient to be able to identify how big the final data
frame is going to have to be at the beginning and preallocate the
result memory than to incrementally allocate larger and larger data
frames along the way using Reduce.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go
Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live
Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#.
rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On August 10, 2014 11:51:22 AM PDT, Grant Rettke
<gcr at wisdomandwonder.com> wrote:
Good afternoon, Today I was working on a practice problem. It was simple, and perhaps even realistic. It looked like this: ? Get a list of all the data files in a directory ? Load each file into a dataframe ? Merge them into a single data frame Because all of the columns were the same, the simplest solution in my mind was to `Reduce' the vector of dataframes with a call to `merge'. That worked fine, I got what was expected. That is key actually. It is literally a one-liner, and there will never be index or scoping errors with it. Now with that in mind, what is the idiomatic way? Do people usually do something else because it is /faster/ (by some definition)? Kind regards, Grant Rettke | ACM, ASA, FSF, IEEE, SIAM gcr at wisdomandwonder.com | http://www.wisdomandwonder.com/ ?Wisdom begins in wonder.? --Socrates ((? (x) (x x)) (? (x) (x x))) ?Life has become immeasurably better since I have been forced to stop taking it seriously.? --Thompson
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
On Sun, Aug 10, 2014 at 1:51 PM, Grant Rettke <gcr at wisdomandwonder.com> wrote:
Good afternoon, Today I was working on a practice problem. It was simple, and perhaps even realistic. It looked like this: ? Get a list of all the data files in a directory
OK, I assume this results in a vector of file names in a variable, like you'd get from list.files();
? Load each file into a dataframe
Why? Do you need them in separate data frames?
? Merge them into a single data frame
The meat of the question. If you don't need the files in separate data
frames, and the files do _NOT_ have headers, then I would just load
them all into a single frame. I used Linux and so my solution may not
work on Windows. Something like:
list_of_files = list.files(pattern=".*data$"); # list of data files
#
# command to list contents of all files to stdout:
command <- pipe(paste('cat',list_of_files));
read.table(command,header=FALSE);
I would guess that Windows has something equivalent to cat, is it
"type"? I have a vague memory of that.
The above will work with header=TRUE, but the headers in the second
and subsequent files are taken as data. And if you have row.names in
the data, such as write.csv() does, then this is really not for you.
Well, at least it would not be as simple. There are ways around it
using a more intelligent "copy" program than "cat". Such as AWK. If
you need an AWK example, I can fake one up. It would strip the headers
from the 2nd and subsequent files and remove the first column
"row.names" values. Not really all that difficult, but "fiddly".
Because all of the columns were the same, the simplest solution in my mind was to `Reduce' the vector of dataframes with a call to `merge'. That worked fine, I got what was expected. That is key actually. It is literally a one-liner, and there will never be index or scoping errors with it. Now with that in mind, what is the idiomatic way? Do people usually do something else because it is /faster/ (by some definition)? Kind regards,
There is nothing more pleasant than traveling and meeting new people! Genghis Khan Maranatha! <>< John McKown
On Sun, Aug 10, 2014 at 6:50 PM, John McKown
<john.archie.mckown at gmail.com> wrote:
OK, I assume this results in a vector of file names in a variable, like you'd get from list.files();
Yes.
Why? Do you need them in separate data frames?
I do not.
The meat of the question. If you don't need the files in separate data frames, and the files do _NOT_ have headers, then I would just load them all into a single frame. I used Linux and so my solution may not work on Windows. Something like:
Excellent point. All of the files do have the same header. I'm on OSX so there must be a nice one liner to concatenate all of the individual files, dropping the first line for all but the first. Danke!
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20140811/b607cb22/attachment.pl>
On Mon, Aug 11, 2014 at 9:43 PM, Thomas Adams <tea3rd at gmail.com> wrote:
Grant,
Assuming all your filenames are something like file1.txt,
file2.txt,file3.txt... And using the Mac OSX terminal app (after you cd to
the directory where your files are located...
This will strip off the 1st lines, that is, your header lines:
for file in *.txt;do
sed -i '1d'${file};
done
Then, do this:
cat *.txt > newfilename.txt
Doing both should only take a few seconds, depending on your file sizes.
Cheers!
Tom
Using sed hadn't occurred to me. I guess I'm just "awk-ward" <grin/>.
A slightly different way would be:
for file in *.txt;do
sed '1d' ${file}
done >newfilename.txt
that way the original files are not modified. But it strips out the
header on the 1st file as well. Not a big deal, but the read.table
will need to be changed to accommodate that. Also, it creates an
otherwise unnecessary intermediate file "newfilename.txt". To get the
1st file's header, the script could:
head -1 >newfilename.txt
for file in *.txt;do
sed '1d' ${file}
done >>newfilename.txt
I really like having multiple answers to a given problem. Especially
since I have a poorly implemented version of "awk" on one of my
systems. It is the vendor's "awk" and conforms exactly to the POSIX
definition with no additions. So I don't have the FNR built-in
variable. Your implementation would work well on that system. Well, if
there were a version of R for it. It is a branded UNIX system which
was designed to be totally __and only__ POSIX compliant, with few
(maybe no) extensions at all. IOW, it stinks. No, it can't be
replaced. It is the z/OS system from IBM which is EBCDIC based and
runs on the "big iron" mainframe, system z.
There is nothing more pleasant than traveling and meeting new people! Genghis Khan Maranatha! <>< John McKown
On Aug 11, 2014, at 8:01 PM, John McKown wrote:
On Mon, Aug 11, 2014 at 9:43 PM, Thomas Adams <tea3rd at gmail.com> wrote:
Grant,
Assuming all your filenames are something like file1.txt,
file2.txt,file3.txt... And using the Mac OSX terminal app (after you cd to
the directory where your files are located...
This will strip off the 1st lines, that is, your header lines:
for file in *.txt;do
sed -i '1d'${file};
done
Then, do this:
cat *.txt > newfilename.txt
Doing both should only take a few seconds, depending on your file sizes.
Cheers!
Tom
Using sed hadn't occurred to me. I guess I'm just "awk-ward" <grin/>.
A slightly different way would be:
for file in *.txt;do
sed '1d' ${file}
done >newfilename.txt
that way the original files are not modified. But it strips out the
header on the 1st file as well. Not a big deal, but the read.table
will need to be changed to accommodate that. Also, it creates an
otherwise unnecessary intermediate file "newfilename.txt". To get the
1st file's header, the script could:
head -1 >newfilename.txt
for file in *.txt;do
sed '1d' ${file}
done >>newfilename.txt
I really like having multiple answers to a given problem. Especially
since I have a poorly implemented version of "awk" on one of my
systems. It is the vendor's "awk" and conforms exactly to the POSIX
definition with no additions. So I don't have the FNR built-in
variable. Your implementation would work well on that system. Well, if
there were a version of R for it. It is a branded UNIX system which
was designed to be totally __and only__ POSIX compliant, with few
(maybe no) extensions at all. IOW, it stinks. No, it can't be
replaced. It is the z/OS system from IBM which is EBCDIC based and
runs on the "big iron" mainframe, system z.
--
On the Mac the awk equivalent is gawk. Within R you would use `system()` possibly using paste0() to construct a string to send.
David Winsemius Alameda, CA, USA
On 12/08/2014 07:07, David Winsemius wrote:
On Aug 11, 2014, at 8:01 PM, John McKown wrote:
On Mon, Aug 11, 2014 at 9:43 PM, Thomas Adams <tea3rd at gmail.com> wrote:
Grant,
Assuming all your filenames are something like file1.txt,
file2.txt,file3.txt... And using the Mac OSX terminal app (after you cd to
the directory where your files are located...
This will strip off the 1st lines, that is, your header lines:
for file in *.txt;do
sed -i '1d'${file};
done
Then, do this:
cat *.txt > newfilename.txt
Doing both should only take a few seconds, depending on your file sizes.
Cheers!
Tom
Using sed hadn't occurred to me. I guess I'm just "awk-ward" <grin/>.
A slightly different way would be:
for file in *.txt;do
sed '1d' ${file}
done >newfilename.txt
that way the original files are not modified. But it strips out the
header on the 1st file as well. Not a big deal, but the read.table
will need to be changed to accommodate that. Also, it creates an
otherwise unnecessary intermediate file "newfilename.txt". To get the
1st file's header, the script could:
head -1 >newfilename.txt
for file in *.txt;do
sed '1d' ${file}
done >>newfilename.txt
I really like having multiple answers to a given problem. Especially
since I have a poorly implemented version of "awk" on one of my
systems. It is the vendor's "awk" and conforms exactly to the POSIX
definition with no additions. So I don't have the FNR built-in
variable. Your implementation would work well on that system. Well, if
there were a version of R for it. It is a branded UNIX system which
was designed to be totally __and only__ POSIX compliant, with few
(maybe no) extensions at all. IOW, it stinks. No, it can't be
replaced. It is the z/OS system from IBM which is EBCDIC based and
runs on the "big iron" mainframe, system z.
--
On the Mac the awk equivalent is gawk. Within R you would use `system()` possibly using paste0() to construct a string to send.
For historical reasons this is actually part of R's configuration: see the AWK entry in R_HOME/etc/Makeconf. (There is an SED entry too: not all sed's in current OSes are POSIX-compliant.) Using system2() rather than system() is recommended for new code.
Brian D. Ripley, ripley at stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK
Thank you all kindly. Grant Rettke | ACM, AMA, COG, IEEE grettke at acm.org | http://www.wisdomandwonder.com/ ?Wisdom begins in wonder.? --Socrates ((? (x) (x x)) (? (x) (x x))) ?Life has become immeasurably better since I have been forced to stop taking it seriously.? --Thompson
On Tue, Aug 12, 2014 at 1:07 AM, David Winsemius <dwinsemius at comcast.net> wrote:
On Aug 11, 2014, at 8:01 PM, John McKown wrote:
On Mon, Aug 11, 2014 at 9:43 PM, Thomas Adams <tea3rd at gmail.com> wrote:
Grant,
Assuming all your filenames are something like file1.txt,
file2.txt,file3.txt... And using the Mac OSX terminal app (after you cd to
the directory where your files are located...
This will strip off the 1st lines, that is, your header lines:
for file in *.txt;do
sed -i '1d'${file};
done
Then, do this:
cat *.txt > newfilename.txt
Doing both should only take a few seconds, depending on your file sizes.
Cheers!
Tom
Using sed hadn't occurred to me. I guess I'm just "awk-ward" <grin/>.
A slightly different way would be:
for file in *.txt;do
sed '1d' ${file}
done >newfilename.txt
that way the original files are not modified. But it strips out the
header on the 1st file as well. Not a big deal, but the read.table
will need to be changed to accommodate that. Also, it creates an
otherwise unnecessary intermediate file "newfilename.txt". To get the
1st file's header, the script could:
head -1 >newfilename.txt
for file in *.txt;do
sed '1d' ${file}
done >>newfilename.txt
I really like having multiple answers to a given problem. Especially
since I have a poorly implemented version of "awk" on one of my
systems. It is the vendor's "awk" and conforms exactly to the POSIX
definition with no additions. So I don't have the FNR built-in
variable. Your implementation would work well on that system. Well, if
there were a version of R for it. It is a branded UNIX system which
was designed to be totally __and only__ POSIX compliant, with few
(maybe no) extensions at all. IOW, it stinks. No, it can't be
replaced. It is the z/OS system from IBM which is EBCDIC based and
runs on the "big iron" mainframe, system z.
--
On the Mac the awk equivalent is gawk. Within R you would use `system()` possibly using paste0() to construct a string to send. -- David Winsemius Alameda, CA, USA
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.