Skip to content

how to read a group of files into one dataset?

5 messages · Jie TANG, Eik Vettorazzi, Mohammed Ouassou +2 more

#
Hi Jie,
you have to merge the sequential data.frames, and depending on the
structure of your inputs  and the way you want your resulting data.frame
(which you both didn't specify) either ?merge  or ?rbind should help.

cheers


Am 25.08.2011 10:17, schrieb Jie TANG:

  
    
#
inputDataPath  <- "/home/.../bla/";  #Directory containing data files 
szPattern      <-  ".dat";           # File extension 

# Get all files name in the specified directory
file2process <- list.files(inputDataPath, pattern=szPattern); 

 # Get number of files to be processed  
iFileCnt     <- length(file2process);  
dbMatrix     <- list();      # Empty list (Your local database)
for (i in 1:iFileCnt)
    {
       dataFile     <- sprintf("%s%s", inputDataPath, file2process[i]);
       dbMatrix[i]  <- dataFile;
    }
ldb <- lapply(dbMatrix, read.table, header = T);

local database ldb is an array of matrix, each matrix contains 1 data
file.

 # Get the matrix from list(local database)
 Mat <- as.matrix(ldb[[i]]);



I hope this will help !
On to., 2011-08-25 at 11:43 +0200, Eik Vettorazzi wrote:
#
Hi:

Similar in vein to the other respondents, you could try something like this:
On Thu, Aug 25, 2011 at 1:17 AM, Jie TANG <totangjie at gmail.com> wrote:
# Your file names (assuming they are in your startup directory -
# see list.files() for a more general approach, as mentioned previously)
This following assumes each data frame in flnm has the same set of
variables and  the same number of columns.

# Method 1:  base R code

  newdata <- lapply(flnm, read.table, skip = 2)
  bigdf <- do.call(rbind, newdata)

# Method 2: Use the plyr package

library('plyr')
bdf <- ldply(mlply(files, read.csv, header = TRUE), rbind)

bigdf and bdf should have the same number of rows; bdf will have one
more column than bigdf because the first column of bdf is an indicator
of the initial data frame it came from, with a numerical rather than a
character index.

The inner call, mlply, is analogous to the lapply() function from
method 1, and the outer call, ldply, has a similar effect to
do.call().

Here's an example. I have ten files named file_01.csv - file_10.csv in
my startup directory; each has 20 rows and 2 columns, with the same
column names in each.
[1] "file_01.csv" "file_02.csv" "file_03.csv" "file_04.csv" "file_05.csv"
 [6] "file_06.csv" "file_07.csv" "file_08.csv" "file_09.csv" "file_10.csv"

### Method 1:
[1] 200   2
# Show this is right by returning the numbers of rows and cols
# in each list component of filelist
[1] 20 20 20 20 20 20 20 20 20 20
[1] 2 2 2 2 2 2 2 2 2 2

# Method 2:
library('plyr')
[1] 200   3
X1 id count
1  1  1    47
2  1  2    36
3  1  3    53
id count
1  1    47
2  2    36
3  3    53
1  2  3  4  5  6  7  8  9 10
20 20 20 20 20 20 20 20 20 20

HTH,
Dennis
#
Or just

bdf <- ldply(files, read.csv, header = TRUE)

Hadley