Complex text parsing task
Hi Paul, I do not think that Nick's comment was really meant to be directed at you. He is probably just tired of getting so many emails from R-help. Nick, to stop getting emails if you no longer want them, try following the link at the bottom of every single email you have received from R-help...you can unsubscribe yourself from there if you want. If you like R-help but just do not like the quantity of emails, you could consider switching your subscription to a daily digest so you just get one email. Alternately, you could create a special folder in your email for R-help messages, and create a filter that automatically sends all message from R-help to that special folder so you still have them all but they do not clutter up your inbox. Cheers, Josh
On Mon, May 21, 2012 at 8:53 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:
Hi Nick, Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post? Thanks, Paul --- On Mon, 5/21/12, Nick Gayeski <nick at wildfishconservancy.org> wrote:
From: Nick Gayeski <nick at wildfishconservancy.org>
Subject: RE: [R] Complex text parsing task
To: "'Paul Miller'" <pjmiller_57 at yahoo.com>, r-help at r-project.org
Received: Monday, May 21, 2012, 10:36 AM
Please stop sending these emails!
-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org]
On
Behalf Of Paul Miller
Sent: Monday, May 21, 2012 8:32 AM
To: r-help at r-project.org
Subject: [R] Complex text parsing task
Hello Everyone,
I have what I think is a complex text parsing task. I've
provided some
sample data below. There's a relatively simple version of
the coding that
needs to be done and a more complex version. If someone
could help me out
with either version, I'd greatly appreciate it.
Here are my sample data.
haveData <-
structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L,
3L, 3L, 4L, 4L,
5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001 ",
"001-002 ", "001-003 ", "001-004 ", "001-005 ", "001-006 ",
"001-007 "
), class = "factor"), encounter_date = structure(c(9L, 10L,
11L, 12L, 13L,
5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c("
2009-03-01 ", "
2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", " 2010-10-15
", " 2010-11-15
", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ", "
2011-10-24 ", "
2012-09-15 ", " 2012-10-05 ", " 2012-10-17 "
), class = "factor"), raw = structure(c(9L, 12L, 16L, 13L,
10L, 7L, 6L, 3L,
2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(" ... If
patient KRAS result
is wild type, they will start Erbitux. ... (Several lines of
material) ...
Ordered KRAS mutation test 11/11/2011. Results are still not
available. ...
", " ... KRAS (mutated). Therefore did not prescribe
Erbitux. ... ", " ...
KRAS (mutated). Will not prescribe Erbitux due to mutation.
... ", " ...
KRAS (Wild). ...", " ... KRAS results are in. Patient has
the mutation. ...
", " ... KRAS results still pending. Note that patient was
negative for
Lynch mutation. ...", " ... KRAS test results pending. Note
that patient was
negative for Lynch mutation. ...", " ... Ordered KRAS
mutation testing on
02/15/2011. Results came back negative. ... (Several lines
of material) ...
Patient KRAS mutation test is negative. Will start Erbitux.
...", " ...
Ordered KRAS testing on 10/10/2010. Results not yet
available. If patient
has a mutaton, will start Erbitux. ...", " ... Ordered KRAS
testing. Waiting
for results. ...", " ... Patient is KRAS negative. Started
Erbitux on
03/01/2011. ...", " ... Received KRAS results on 10/20/2010.
Test results
indicate tumor is wild type. Ua Protein positve. ER/PR
positive. HER2/neu
positve. ...", " ... Still need to order KRAS mutation
testing. ... ", " ...
Tumor is negative for KRAS mutation. ...", " ... Tumor is
wild type. Patient
is eligible to receive Eribtux. ...", " ... Will conduct
KRAS mutation
testing prior to initiation of therapy with Erbitux. ..."
), class = "factor")), .Names = c("profile_key",
"encounter_date", "raw"),
row.names = c(NA, -16L), class = "data.frame")
The following code displays the results of so-called
"simple" coding.
#### Simple coding ####
KRASpatient <- c("001-001", "001-002", "001-003",
"001-004", "001-005",
"001-006",? "001-007") KRAStested <-
c(2,3,2,2,2,3,3) KRASwild <-
c(1,0,2,0,3,1,3) KRASmutant <- c(4,2,2,3,1,2,2)
simpleData <-
data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant)
simpleData
Here, KRAStested is calculated by summing all references to
"KRAS" for each
patient. Wild is calculated by summing all references to
"wild type",
"wild", and "negative" that come within 20 words of the
closest reference to
KRAS. Mutant is calculated by summing all references to
"mutant", "mutated",
and "positive" that occur within 20 words of the closest
reference to KRAS.
The second kind of coding is what I'm referring to as
"complex coding".? The
following code displays the results of this type of coding.
#### Complex coding ####
KRAStested <- c(2,1,0,2,2,2,3)
KRASwild <- c(1,0,0,0,3,0,3)
KRASmutant <- c(0,0,0,3,0,1,0)
complexData <- data.frame(KRASpatient, KRAStested,
KRASwild, KRASmutant)
complexData
The results of "complex coding" differ substantially from
those obtained
under "simple coding" and I think illustrate the potential
problems with
that approach. With "complex coding", the goal would be to
identify and sum
only true references to KRAS testing and true references to
the result of
that testing (either wild type/negative or
mutant/positive).
True references to KRAS testing would be identified using a
set of
qualifiers that eliminate the false references. So, for
example, one of the
patients in my (made up) sample data has the phrase "Will
conduct KRAS
mutation testing prior to initiation of therapy with
Erbitux" in their
medical record. In this case, "Will" is a qualifier that
indicates this is
not a true reference to KRAS testing. For this exercise,
other qualifiers
related to KRAS testing would include "need", "order" (but
not the past
tense "ordered"), "wait", "waiting", "await", and
"awaiting".
To be a qualifier, these terms would need to occur within 12
words of the
closest true reference to KRAS.
True references to the results of testing would also be
identified using a
set of qualifiers that eliminate false references. Here the
list of
qualifiers would include "if", "lynch", "kras mutation
test", "kras mutation
testing" and "for kras mutation". Qualifiers would need to
come within 12
words of a true reference to KRAS testing.
There's an additional wrinkle for identifying true
references to the results
of testing. One also needs to take into account the presence
of what I'm
calling "nullifiers". For purposes of this exercise,
nullfiers include "Ua
Protein", "ER/PR", and "HER2/neu" If "positive" or
"negative" come closer to
one of these words than to a true reference to KRAS, then
they should not be
used to identify the results of KRAS testing.
Help with either type of coding would be greatly
appreciated.
Thanks,
Paul
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/