Skip to content

stacking imported data

4 messages · Sundar Dorai-Raj, Spencer Graves, Gabor Grothendieck

#
Hi all,
   I have a question that I don't have a good answer for (note the word 
"good"; I have an answer, but I consider it not "good"). Take the 
following data in a single tab-delimited text file:

<text>

A
Labels	Value	SE	2.5%	97.5%
R90	0.231787	1.148044	0.035074	1.531779
R0	0.500861	0.604406	0.185336	1.353552

B
Labels	Value	SE	2.5%	97.5%
(Intercept)	1.367514	0.036431	1.287975	1.451964
</text>

(Note: the <text> tags are not present and are added here only to show 
blank lines.)

I would like to read the data into a single data.frame which looks like

Labels	Value	SE	2.5%	97.5%
A.R90	0.231787	1.148044	0.035074	1.531779
A.R0	0.500861	0.604406	0.185336	1.353552
B.(Intercept)	1.367514	0.036431	1.287975	1.451964

A few rules:

1. the number of rows in "A" and "B" will vary from 1 to ???. Here "A" 
has 1 row (excluding header) and B has 2 rows (excluding header).
2. the number of columns in "A" and "B" will always be the same.
4. the headers for "A" and "B" will always be the same.
3. there is always an empty line at the beginning of the file and in 
between "A" and "B".

My solution involves scan and indexing though it is error prone and not 
flexible if more or less than 5 columns are present in the data. While 
the number of columns is always the same from "A" to "B" it may change 
that "A" and "B" have more or fewer columns.

I hope this makes sense.

Thanks,

--sundar
#
Hi, Sundar: 

      I got something that looks like it might be what you want copying 
your 8 lines to clipboard and using the following: 

DF <- read.table("clipboard",  colClasses=character(0), fill=TRUE)
breaks <- which(DF$V2=="")
nrows <- diff(c(breaks, dim(DF)[1]+1))

files <- as.character(DF[breaks,1])

DF2 <- cbind(rep(files, nrows), DF)

DF. <- DF2[-c(breaks, breaks+1),]
DFnames <- as.matrix(DF[breaks[1]+1, ])
names(DF.) <- c("Files", DFnames)
#################
Result: 
DF

  Files      Labels    Value       SE     2.5%    97.5%
3     A         R90 0.231787 1.148044 0.035074 1.531779
4     A          R0 0.500861 0.604406 0.185336 1.353552
7     B (Intercept) 1.367514 0.036431 1.287975 1.451964

      This uses the "fill" argument in data.frame in R that Andy Liaw 
mentioned earlier today.  (Thus, this solution won't work in S-Plus 6.2, 
where data.frame does not have this argument.) 

      Is this satisfactory? 
      Spencer Graves
Sundar Dorai-Raj wrote:

            

  
    
#
Sundar Dorai-Raj <sundar.dorai-raj <at> pdf.com> writes:

: 
: Hi all,
:    I have a question that I don't have a good answer for (note the word 
: "good"; I have an answer, but I consider it not "good"). Take the 
: following data in a single tab-delimited text file:
: 
: <text>
: 
: A
: Labels	Value	SE	2.5%	97.5%
: R90	0.231787	1.148044	0.035074	1.531779
: R0	0.500861	0.604406	0.185336	1.353552
: 
: B
: Labels	Value	SE	2.5%	97.5%
: (Intercept)	1.367514	0.036431	1.287975	1.451964
: </text>
: 
: (Note: the <text> tags are not present and are added here only to show 
: blank lines.)
: 
: I would like to read the data into a single data.frame which looks like
: 
: Labels	Value	SE	2.5%	97.5%
: A.R90	0.231787	1.148044	0.035074	1.531779
: A.R0	0.500861	0.604406	0.185336	1.353552
: B.(Intercept)	1.367514	0.036431	1.287975	1.451964
: 
: A few rules:
: 
: 1. the number of rows in "A" and "B" will vary from 1 to ???. Here "A" 
: has 1 row (excluding header) and B has 2 rows (excluding header).
: 2. the number of columns in "A" and "B" will always be the same.
: 4. the headers for "A" and "B" will always be the same.
: 3. there is always an empty line at the beginning of the file and in 
: between "A" and "B".
: 

Read the lines into vector z, one line per element.

Define a grouping variable, g, which is 1 for the lines
starting at the first blank line and 2 for the lines
starting at the 2nd.  Define a function f which accepts such
a group of lines and creates the appropriate data frame from
them.  tapply the lines, grouped by g, and bind the rows of
the data frame produced from each group together into one
large data frame.  

z <- readLines("file.dat")

g <- cumsum(nchar(z) == 0)
f <- function(x) {
	x[-(1:3)] <- paste(trim(x[2]), x[-(1:3)], sep = ".")
	read.table(textConnection(x[-(1:2)]), header = TRUE)
}
do.call("rbind", tapply(z, cumsum(nchar(z) == 0), f))


Note: if the blank lines or the A and B lines contain
whitespace trim this off first.  That is, insert these
two lines after the readLines statement:

trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x)
z <- trim(z)
#
Gabor Grothendieck <ggrothendieck <at> myway.com> writes:

: 
: Sundar Dorai-Raj <sundar.dorai-raj <at> pdf.com> writes:
: 
: : 
: : Hi all,
: :    I have a question that I don't have a good answer for (note the word 
: : "good"; I have an answer, but I consider it not "good"). Take the 
: : following data in a single tab-delimited text file:
: : 
: : <text>
: : 
: : A
: : Labels	Value	SE	2.5%	97.5%
: : R90	0.231787	1.148044	0.035074	1.531779
: : R0	0.500861	0.604406	0.185336	1.353552
: : 
: : B
: : Labels	Value	SE	2.5%	97.5%
: : (Intercept)	1.367514	0.036431	1.287975	1.451964
: : </text>
: : 
: : (Note: the <text> tags are not present and are added here only to show 
: : blank lines.)
: : 
: : I would like to read the data into a single data.frame which looks like
: : 
: : Labels	Value	SE	2.5%	97.5%
: : A.R90	0.231787	1.148044	0.035074	1.531779
: : A.R0	0.500861	0.604406	0.185336	1.353552
: : B.(Intercept)	1.367514	0.036431	1.287975
	1.451964
: : 
: : A few rules:
: : 
: : 1. the number of rows in "A" and "B" will vary from 1 to ???. Here "A" 
: : has 1 row (excluding header) and B has 2 rows (excluding header).
: : 2. the number of columns in "A" and "B" will always be the same.
: : 4. the headers for "A" and "B" will always be the same.
: : 3. there is always an empty line at the beginning of the file and in 
: : between "A" and "B".
: : 
: 
: Read the lines into vector z, one line per element.
: 
: Define a grouping variable, g, which is 1 for the lines
: starting at the first blank line and 2 for the lines
: starting at the 2nd.  Define a function f which accepts such
: a group of lines and creates the appropriate data frame from
: them.  tapply the lines, grouped by g, and bind the rows of
: the data frame produced from each group together into one
: large data frame.  
: 
: z <- readLines("file.dat")
: 
: g <- cumsum(nchar(z) == 0)
: f <- function(x) {
: 	x[-(1:3)] <- paste(trim(x[2]), x[-(1:3)], sep = ".")
: 	read.table(textConnection(x[-(1:2)]), header = TRUE)
: }
: do.call("rbind", tapply(z, cumsum(nchar(z) == 0), f))

A correction:

 z <- readLines("file.dat")
 
 g <- cumsum(nchar(z) == 0)
 f <- function(x) {
 	x[-(1:3)] <- paste(x[2], x[-(1:3)], sep = ".")
 	read.table(textConnection(x[-(1:2)]), header = TRUE)
 }
 do.call("rbind", tapply(z, cumsum(nchar(z) == 0), f))


: 
: Note: if the blank lines or the A and B lines contain
: whitespace trim this off first.  That is, insert these
: two lines after the readLines statement:
: 
: trim <- function(x) gsub("^[[:space:]]+|[[:space:]]+$", "", x)
: z <- trim(z)
: 
: ______________________________________________
: R-help <at> stat.math.ethz.ch mailing list
: https://stat.ethz.ch/mailman/listinfo/r-help
: PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
: 
: