An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20120503/2e26f665/attachment.pl>
loading multiple CSV files into a single data frame
6 messages · victor jimenez, Gabor Grothendieck, oliver +2 more
On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com> wrote:
Sometimes I have hundreds of CSV files scattered in a directory tree,
resulting from experiments' executions. For instance, giving an example
from my field, I may want to collect the performance of a processor for
several design parameters such as "cache size" (possible values: 2, 4, 8
and 16) and "cache associativity" (possible values: direct-mapped, 4-way,
fully-associative). The results of all these experiments will be stored in
a directory tree like:
results
?|-- direct-mapped
?| ? ? ? |-- 2 -- data.csv
?| ? ? ? |-- 4 -- data.csv
?| ? ? ? |-- 8 -- data.csv
?| ? ? ? |-- 16 -- data.csv
?|-- 4-way
?| ? ? ? |-- 2 -- data.csv
?| ? ? ? |-- 4 -- data.csv
...
?|-- fully-associative
?| ? ? ? |-- 2 -- data.csv
?| ? ? ? |-- 4 -- data.csv
...
I am developing a package that would allow me to gather all those CSV into
a single data frame. Currently, I just need to execute the following
statement:
dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")
and this command returns a data frame containing the columns ASSOC, SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a single
data frame. So, I would get something like:
ASSOC, ? ? ? ? ?SIZE, PERF
direct-mapped, ? ? ? 2, ? ? 1.4
direct-mapped, ? ? ? 4, ? ? 1.6
direct-mapped, ? ? ? 8, ? ? 1.7
direct-mapped, ? ? 16, ? ? 1.7
4-way, ? ? ? ? ? ? ? ? ? 2, ? ? 1.4
4-way, ? ? ? ? ? ? ? ? ? 4, ? ? 1.5
...
I would like to ask whether there is any similar functionality already
implemented in R. If so, there is no need to reinvent the wheel :)
If it is not implemented and the R community believes that this feature
would be useful, I would be glad to contribute my code.
If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.
library(zoo)
?read.zoo
vignette("zoo-read")
Also see the other zoo vignettes and help files.
Statistics & Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20120503/28be71b2/attachment.pl>
On Thu, May 03, 2012 at 11:40:42PM +0200, victor jimenez wrote:
First of all, thank you for the answers. I did not know about zoo. However, it seems that none approach can do what I exactly want (please, correct me if I am wrong). Probably, it was not clear in my original question. The CSV files only contain the performance values. The other two columns (ASSOC and SIZE) are obtained from the existing values in the directory tree. So, in my opinion, none of the proposed solutions would work, unless every single "data.csv" file contained all the three columns (ASSOC, SIZE and PERF).
[...]
Maybe things will be clearer if you would provide an example
with the tree and some example data, which you provide as a*.zip file.
As I undertand your question, you have a some variables' values
stored in the csv-files, and other values of your variables
are given as directory structure.
So you need to convert the structure of your directory
into values fo your dataframe.
You need to have a dataframe that contains all possible values that are of
interest to you.
Some of them are loaded via the csv-load and others are just picked
from the directory structure.
You just have to fill in the data from the csv into the dataframe,
and the values/variables that are implictly given via the directory structure,
you just set when importing.
Maybe just read in the csv-files and add the missing values.
So if the variable on the cahcing mechanism is
encode as part of the path to the file, e.g. "direct-mapped",
then just set the chace value to "direct-mapped".
Ciao,
Oliver
P.S.: In my understandiung this would be rather r-users instead of r-devel,
because I think r-devel seems to be more focussed on internals and
package stuff, while your problem is rather a user problem
(any R user needs some kind of "programming" to get things done).
Victor,
I understand you as follows
The first two columns of the desired combined dataframe are the last two
levels of the pathname to the csv file.
The columns in all the data.csv files are the same, namely, there is only
one column, and it is named PERF.
If so, the following should work (on unix)
do.call(rbind,lapply(Sys.glob('results/*/*/data.csv'),function(path)
{within(read.csv(path),{ SIZE<-basename(dirname(path));
ASSOC<-basename(dirname(dirname(path)))})}))
On 5/3/12 4:40 PM, "victor jimenez" <betabandido at gmail.com> wrote:
First of all, thank you for the answers. I did not know about zoo. However, it seems that none approach can do what I exactly want (please, correct me if I am wrong). Probably, it was not clear in my original question. The CSV files only contain the performance values. The other two columns (ASSOC and SIZE) are obtained from the existing values in the directory tree. So, in my opinion, none of the proposed solutions would work, unless every single "data.csv" file contained all the three columns (ASSOC, SIZE and PERF). In my case, my experimentation framework basically outputs a CSV with some values read from the processor's performance counters (PMCs). For each cache size and associativity I conduct an experiment, creating a CSV file, and placing that file into its own directory. I could modify the experimentation framework, so that it also outputs the cache size and associativity, but that may not be ideal in some circumstances and I also have a significant amount of old results and I want keep using them without manually fixing the CSV files. Has anyone else faced such a situation? Any good solutions? Thank you, Victor On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck <ggrothendieck at gmail.com>wrote:
On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com> wrote:
Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an
example
from my field, I may want to collect the performance of a processor
for
several design parameters such as "cache size" (possible values: 2,
4, 8
and 16) and "cache associativity" (possible values: direct-mapped,
4-way,
fully-associative). The results of all these experiments will be
stored in
a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV
into
a single data frame. Currently, I just need to execute the following
statement:
dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")
and this command returns a data frame containing the columns ASSOC,
SIZE
and all the remaining columns inside the CSV files (in my case the processor performance), effectively loading all the CSV files into a
single
data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this
feature
would be useful, I would be glad to contribute my code.
If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.
library(zoo)
?read.zoo
vignette("zoo-read")
Also see the other zoo vignettes and help files.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
On May 3, 2012, at 5:40 PM, victor jimenez wrote:
First of all, thank you for the answers. I did not know about zoo. However, it seems that none approach can do what I exactly want (please, correct me if I am wrong). Probably, it was not clear in my original question. The CSV files only contain the performance values. The other two columns (ASSOC and SIZE) are obtained from the existing values in the directory tree. So, in my opinion, none of the proposed solutions would work, unless every single "data.csv" file contained all the three columns (ASSOC, SIZE and PERF). In my case, my experimentation framework basically outputs a CSV with some values read from the processor's performance counters (PMCs). For each cache size and associativity I conduct an experiment, creating a CSV file, and placing that file into its own directory. I could modify the experimentation framework, so that it also outputs the cache size and associativity, but that may not be ideal in some circumstances and I also have a significant amount of old results and I want keep using them without manually fixing the CSV files.
You don't need to touch the CSV files, simply add values at load time - this is all easily doable in one line ;)
do.call("rbind",lapply(Sys.glob("*/*/data.csv"),function(d) cbind(read.csv(d),as.data.frame(t(strsplit(d,"/")[[1]])))))
A B V1 V2 V3 1 1 2 1 a data.csv 2 3 4 1 a data.csv 3 1 2 1 b data.csv 4 3 4 1 b data.csv 5 1 2 2 a data.csv 6 3 4 2 a data.csv
Has anyone else faced such a situation? Any good solutions? Thank you, Victor On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck <ggrothendieck at gmail.com>wrote:
On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com> wrote:
Sometimes I have hundreds of CSV files scattered in a directory tree, resulting from experiments' executions. For instance, giving an example from my field, I may want to collect the performance of a processor for several design parameters such as "cache size" (possible values: 2, 4, 8 and 16) and "cache associativity" (possible values: direct-mapped, 4-way, fully-associative). The results of all these experiments will be stored
in
a directory tree like: results |-- direct-mapped | |-- 2 -- data.csv | |-- 4 -- data.csv | |-- 8 -- data.csv | |-- 16 -- data.csv |-- 4-way | |-- 2 -- data.csv | |-- 4 -- data.csv ... |-- fully-associative | |-- 2 -- data.csv | |-- 4 -- data.csv ... I am developing a package that would allow me to gather all those CSV
into
a single data frame. Currently, I just need to execute the following
statement:
dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")
and this command returns a data frame containing the columns ASSOC, SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a
single
data frame. So, I would get something like: ASSOC, SIZE, PERF direct-mapped, 2, 1.4 direct-mapped, 4, 1.6 direct-mapped, 8, 1.7 direct-mapped, 16, 1.7 4-way, 2, 1.4 4-way, 4, 1.5 ... I would like to ask whether there is any similar functionality already implemented in R. If so, there is no need to reinvent the wheel :) If it is not implemented and the R community believes that this feature would be useful, I would be glad to contribute my code.
If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.
library(zoo)
?read.zoo
vignette("zoo-read")
Also see the other zoo vignettes and help files.
--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
[[alternative HTML version deleted]]
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel