loading multiple CSV files into a single data frame

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20120503/2e26f665/attachment.pl>
Sometimes I have hundreds of CSV files scattered in a directory tree,
resulting from experiments' executions. For instance, giving an example
from my field, I may want to collect the performance of a processor for
several design parameters such as "cache size" (possible values: 2, 4, 8
and 16) and "cache associativity" (possible values: direct-mapped, 4-way,
fully-associative). The results of all these experiments will be stored in
a directory tree like:

results
?|-- direct-mapped
?| ? ? ? |-- 2 -- data.csv
?| ? ? ? |-- 4 -- data.csv
?| ? ? ? |-- 8 -- data.csv
?| ? ? ? |-- 16 -- data.csv
?|-- 4-way
?| ? ? ? |-- 2 -- data.csv
?| ? ? ? |-- 4 -- data.csv
...
?|-- fully-associative
?| ? ? ? |-- 2 -- data.csv
?| ? ? ? |-- 4 -- data.csv
...

I am developing a package that would allow me to gather all those CSV into
a single data frame. Currently, I just need to execute the following
statement:

dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")

and this command returns a data frame containing the columns ASSOC, SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a single
data frame. So, I would get something like:

ASSOC, ? ? ? ? ?SIZE, PERF
direct-mapped, ? ? ? 2, ? ? 1.4
direct-mapped, ? ? ? 4, ? ? 1.6
direct-mapped, ? ? ? 8, ? ? 1.7
direct-mapped, ? ? 16, ? ? 1.7
4-way, ? ? ? ? ? ? ? ? ? 2, ? ? 1.4
4-way, ? ? ? ? ? ? ? ? ? 4, ? ? 1.5
...

I would like to ask whether there is any similar functionality already
implemented in R. If so, there is no need to reinvent the wheel :)
If it is not implemented and the R community believes that this feature
would be useful, I would be glad to contribute my code.

If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.

library(zoo)
?read.zoo
vignette("zoo-read")

Also see the other zoo vignettes and help files.
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20120503/28be71b2/attachment.pl>
First of all, thank you for the answers. I did not know about zoo. However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my opinion,
none of the proposed solutions would work, unless every single "data.csv"
file contained all the three columns (ASSOC, SIZE and PERF).
[...]

Maybe things will be clearer if you would provide an example
with the tree and some example data, which you provide as a*.zip file.

As I undertand your question, you have a some variables' values
stored in the csv-files, and other values of your variables
are given as directory structure.

So you need to convert the structure of your directory
into values fo your dataframe.

You need to have a dataframe that contains all possible values that are of
interest to you.
Some of them are loaded via the csv-load and others are just picked
from the directory structure.

You just have to fill in the data from the csv into the dataframe,
and the values/variables that are implictly given via the directory structure,
you just set when importing.

Maybe just read in the csv-files and add the missing values.

So if the variable on the cahcing mechanism is
encode as part of the path to the file, e.g. "direct-mapped",
then just set the chace value to "direct-mapped".

Ciao,
   Oliver

P.S.: In my understandiung this would be rather r-users instead of r-devel,
      because I think r-devel seems to be more focussed on internals and
      package stuff, while your problem is rather a user problem
      (any R user needs some kind of "programming" to get things done).
Victor,

I understand you as follows

	The first two columns of the desired combined dataframe are the last two
levels of the pathname to the csv file.

	The columns in all the data.csv files are the same, namely, there is only
one column, and it is named PERF.

If so, the following should work (on unix)

do.call(rbind,lapply(Sys.glob('results/*/*/data.csv'),function(path)
{within(read.csv(path),{ SIZE<-basename(dirname(path));
ASSOC<-basename(dirname(dirname(path)))})}))

First of all, thank you for the answers. I did not know about zoo.
However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my
opinion,
none of the proposed solutions would work, unless every single "data.csv"
file contained all the three columns (ASSOC, SIZE and PERF).

In my case, my experimentation framework basically outputs a CSV with some
values read from the processor's performance counters (PMCs). For each
cache size and associativity I conduct an experiment, creating a CSV file,
and placing that file into its own directory. I could modify the
experimentation framework, so that it also outputs the cache size and
associativity, but that may not be ideal in some circumstances and I also
have a significant amount of old results and I want keep using them
without
manually fixing the CSV files.

Has anyone else faced such a situation? Any good solutions?

Thank you,
Victor

On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
<ggrothendieck at gmail.com>wrote:

On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com>
wrote:
Sometimes I have hundreds of CSV files scattered in a directory tree,
resulting from experiments' executions. For instance, giving an
example
from my field, I may want to collect the performance of a processor
for
several design parameters such as "cache size" (possible values: 2,
4, 8
and 16) and "cache associativity" (possible values: direct-mapped,
4-way,
fully-associative). The results of all these experiments will be
stored
in
a directory tree like:

results
 |-- direct-mapped
 |       |-- 2 -- data.csv
 |       |-- 4 -- data.csv
 |       |-- 8 -- data.csv
 |       |-- 16 -- data.csv
 |-- 4-way
 |       |-- 2 -- data.csv
 |       |-- 4 -- data.csv
...
 |-- fully-associative
 |       |-- 2 -- data.csv
 |       |-- 4 -- data.csv
...

I am developing a package that would allow me to gather all those CSV
into
a single data frame. Currently, I just need to execute the following
statement:

dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")

and this command returns a data frame containing the columns ASSOC,
SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a
single
data frame. So, I would get something like:

ASSOC,          SIZE, PERF
direct-mapped,       2,     1.4
direct-mapped,       4,     1.6
direct-mapped,       8,     1.7
direct-mapped,     16,     1.7
4-way,                   2,     1.4
4-way,                   4,     1.5
...

I would like to ask whether there is any similar functionality already
implemented in R. If so, there is no need to reinvent the wheel :)
If it is not implemented and the R community believes that this
feature
would be useful, I would be glad to contribute my code.

If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.

library(zoo)
?read.zoo
vignette("zoo-read")

Also see the other zoo vignettes and help files.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

First of all, thank you for the answers. I did not know about zoo. However,
it seems that none approach can do what I exactly want (please, correct me
if I am wrong).

Probably, it was not clear in my original question. The CSV files only
contain the performance values. The other two columns (ASSOC and SIZE) are
obtained from the existing values in the directory tree. So, in my opinion,
none of the proposed solutions would work, unless every single "data.csv"
file contained all the three columns (ASSOC, SIZE and PERF).

In my case, my experimentation framework basically outputs a CSV with some
values read from the processor's performance counters (PMCs). For each
cache size and associativity I conduct an experiment, creating a CSV file,
and placing that file into its own directory. I could modify the
experimentation framework, so that it also outputs the cache size and
associativity, but that may not be ideal in some circumstances and I also
have a significant amount of old results and I want keep using them without
manually fixing the CSV files.

You don't need to touch the CSV files, simply add values at load time - this is all easily doable in one line ;)
do.call("rbind",lapply(Sys.glob("*/*/data.csv"),function(d) cbind(read.csv(d),as.data.frame(t(strsplit(d,"/")[[1]])))))
A B V1 V2       V3
1 1 2  1  a data.csv
2 3 4  1  a data.csv
3 1 2  1  b data.csv
4 3 4  1  b data.csv
5 1 2  2  a data.csv
6 3 4  2  a data.csv
Has anyone else faced such a situation? Any good solutions?

Thank you,
Victor

On Thu, May 3, 2012 at 8:54 PM, Gabor Grothendieck
<ggrothendieck at gmail.com>wrote:

On Thu, May 3, 2012 at 2:07 PM, victor jimenez <betabandido at gmail.com>
wrote:
Sometimes I have hundreds of CSV files scattered in a directory tree,
resulting from experiments' executions. For instance, giving an example
from my field, I may want to collect the performance of a processor for
several design parameters such as "cache size" (possible values: 2, 4, 8
and 16) and "cache associativity" (possible values: direct-mapped, 4-way,
fully-associative). The results of all these experiments will be stored
in
a directory tree like:

results
|-- direct-mapped
|       |-- 2 -- data.csv
|       |-- 4 -- data.csv
|       |-- 8 -- data.csv
|       |-- 16 -- data.csv
|-- 4-way
|       |-- 2 -- data.csv
|       |-- 4 -- data.csv
...
|-- fully-associative
|       |-- 2 -- data.csv
|       |-- 4 -- data.csv
...

I am developing a package that would allow me to gather all those CSV
into
a single data frame. Currently, I just need to execute the following
statement:

dframe <- gather("results/@ASSOC@/@SIZE@/data.csv")

and this command returns a data frame containing the columns ASSOC, SIZE
and all the remaining columns inside the CSV files (in my case the
processor performance), effectively loading all the CSV files into a
single
data frame. So, I would get something like:

ASSOC,          SIZE, PERF
direct-mapped,       2,     1.4
direct-mapped,       4,     1.6
direct-mapped,       8,     1.7
direct-mapped,     16,     1.7
4-way,                   2,     1.4
4-way,                   4,     1.5
...

I would like to ask whether there is any similar functionality already
implemented in R. If so, there is no need to reinvent the wheel :)
If it is not implemented and the R community believes that this feature
would be useful, I would be glad to contribute my code.

If your csv files all have the same columns and represent time series
then read.zoo in the zoo package can read multiple csv files in at
once using a single read.zoo command producing a single zoo object.

library(zoo)
?read.zoo
vignette("zoo-read")

Also see the other zoo vignettes and help files.

--
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

	[[alternative HTML version deleted]]

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel