Problem with parsing a dataset - help earnestly sought

We can use strapply in the gsubfn package. It extracts
fields matching regular expressions.

strapply extracts the parenthesized part of the regular
expression (or the entire regular expression if nothing
parenthesized), applies the function to it and returns
the result.  See http://gsubfn.googlecode.com

This works with the rules stated below and works on your
example but the general rules may only be apparent with
more data in which case you may need to make appropriate
adjustments.

Note that the regular expressions:

\w refers to a word character and must be written \\w when within in quotes.
+ means one or more occurrences in a row
$ means end of string

library(gsubfn)

NULL2NA <- function(x) if (is.null(x)) NA else x

extract <- function(x) {

	# age is "word" that comes after DX AGE:
	age <- strapply(x, "DX AGE: (\\w+)", c)
	age <- sapply(age, null2NA)

	# tissue is 2 or more word characters at end
	tissue <- strapply(x, "\\w\\w+$", c)
	tissue <- sapply(tissue, null2NA)

	data.frame(age, tissue)
}

extract(data[,2])
extract(data[,3])

Problem with parsing a dataset - help earnestly sought

Thread (3 messages)