Problem with parsing a dataset - help earnestly sought
We can use strapply in the gsubfn package. It extracts fields matching regular expressions. strapply extracts the parenthesized part of the regular expression (or the entire regular expression if nothing parenthesized), applies the function to it and returns the result. See http://gsubfn.googlecode.com This works with the rules stated below and works on your example but the general rules may only be apparent with more data in which case you may need to make appropriate adjustments. Note that the regular expressions: \w refers to a word character and must be written \\w when within in quotes. + means one or more occurrences in a row $ means end of string library(gsubfn) NULL2NA <- function(x) if (is.null(x)) NA else x extract <- function(x) { # age is "word" that comes after DX AGE: age <- strapply(x, "DX AGE: (\\w+)", c) age <- sapply(age, null2NA) # tissue is 2 or more word characters at end tissue <- strapply(x, "\\w\\w+$", c) tissue <- sapply(tissue, null2NA) data.frame(age, tissue) } extract(data[,2]) extract(data[,3])
On Fri, Apr 23, 2010 at 12:24 AM, Min-Han Tan <minhan.science at gmail.com> wrote:
Dear fellow R-help members,
I hope to seek your advice on how to parse/manage a dataset with hundreds of
columns. Two examples of these columns, 'cancer.problems', and
'neuro.problems' are depicted below. Essentially, I need to parse this into
a useful dataset, and unfortunately, I am not familiar with perl or any such
language.
data <- data.frame(id=c(1:10))
data$cancer.problems <- c("Y; DX AGE: 28; COLON", "", "Y; DX AGE: 27;", "Y;
LIVER","","Y","Y; DX AGE: 24;","Y","Y;DX AGE: 44;","Y;DX AGE: 39; TESTIS")
data$neuro.problems <- c("Y: DX AGE: 80-89;","Y","","Y; DX AGE: 74;
STROKE","Y; DEMENTIA","Y","","Y; DX AGE: 33; CHOREA", "Y", "Y; WEAKNESS")
As can be seen, the semi-colon delimiter follows its own set of rules, which
are internally consistent - with all 3 elements of data, it should be
"Status; Age; Tissue Type". However, if there is only tissue type, it is"
Status; Tissue Type", without the trailing semi-colon. However, if there is
Age available, it is "Status; Age;".
The main challenge for me is how to parse/convert this dataset into a useful
and consistent data.frame, or list, where I can capture Status, Age and
Tissue Type as separate fields. Due to the varying application of the
delimiter, I cannot use strsplit consistently. I have tried a convoluted
method by identifying "AGE" as the character string identifying 3 element
fields per below, but faced problems with unlist, given the empty fields.
age.present <- grepl("AGE",data[,2])
data.3column <- strsplit(data[age.present,2],";")
data.2column <- strsplit(data[!age.present,2],";")
data$cancer.status[age.present] <- unlist(data.3column)
[(1:sum(age.present)*3)-2]
...
Your advice is earnestly sought.
Thanks.
Min-Han
? ? ? ?[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.