Parsing
This should do what you want: (it uses loops; you can work at replacing those with 'lapply' and such -- it all depends on if it is going to take you more time to rewrite the code than to process a set of data; you never did say how large the data was). This also "grows" a data.frame, but you have not indicated how efficient is has to be. So this could be used as a model.
x <- readLines(textConnection("x x_string
+ y y_string + id1 id1_string + id2 id2_string + z z_string + w w_string + stuff stuff stuff + stuff stuff stuff + stuff stuff stuff + // + x x_string1 + y y_string1 + z z_string1 + w w_string1 + stuff stuff stuff + stuff stuff stuff + stuff stuff stuff + // + x x_string2 + y y_string2 + id1 id1_string1 + id2 id2_string1 + z z_string2 + w w_string2 + stuff stuff stuff + stuff stuff stuff + stuff stuff stuff + //"))
# I assume that each group is delimited by "//"
# initialize data.frame with desired values
.keys <- data.frame(x=NA, y=NA, id1=NA, id2=NA, w=NA)
.out <- .keys # for the first pass
.save <- NULL
for (i in seq_along(x)){
+ if (x[i] == "//"){ # output the current data
+ .save <- rbind(.save, .out)
+ .out <- .keys # setup for the next pass
+ } else {
+ .split <- strsplit(x[i], "\\s+")
+ if (.split[[1]][1] %in% names(.out)){
+ .out[[.split[[1]][1]]] <- .split[[1]][2]
+ }
+ }
+ }
.save
x y id1 id2 w 1 x_string y_string id1_string id2_string w_string 2 x_string1 y_string1 <NA> <NA> w_string1 3 x_string2 y_string2 id1_string1 id2_string1 w_string2
On Wed, Jul 9, 2008 at 5:33 AM, Paolo Sonego <paolo.sonego at gmail.com> wrote:
Dear R users,
I have a big text file formatted like this:
x x_string
y y_string
id1 id1_string
id2 id2_string
z z_string
w w_string
stuff stuff stuff
stuff stuff stuff
stuff stuff stuff
//
x x_string1
y y_string1
z z_string1
w w_string1
stuff stuff stuff
stuff stuff stuff
stuff stuff stuff
//
x x_string2
y y_string2
id1 id1_string1
id2 id2_string1
z z_string2
w w_string2
stuff stuff stuff
stuff stuff stuff
stuff stuff stuff
//
...
...
I'd like to parse this file and retrieve the x, y, id1, id2, z, w fields and
save them into a a matrix object:
x y id1 id2 z w
x_string y_string id1_string id2_string z_string w_string x_string1
y_string1 NA NA z_string1 w_string1
x_string2 y_string2 id1_string1 id2_string1 z_string2 w_string2
...
...
id1, id2 fields are not always present within a section (the interval
between x and the last stuff) and
I'd like to insert a NA when they are absent (see above) so that
length(x)==length(y)==length(id1)==... .
Without the id1, id2 fields the task is easily solvable importing the text
file with readLines and retrieving the single fields with grep:
input = readLines("file.txt")
x = grep("^x\\s", input, value = T)
id1 = grep("^id1\\s", input, value = T)
...
I'd like to accomplish this task entirely in R (no SQL, no perl script),
possibly without using loops.
Any suggestions are quite welcome!
Regards,
Paolo
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve?