Complex text parsing task
Hi Josh, Thanks for pointing this out. It hadn't occurred to me that someone might post something like this to indicate they would like to receive fewer or no messages. Paul
--- On Mon, 5/21/12, Joshua Wiley <jwiley.psych at gmail.com> wrote:
From: Joshua Wiley <jwiley.psych at gmail.com> Subject: Re: [R] Complex text parsing task To: "Paul Miller" <pjmiller_57 at yahoo.com> Cc: "Nick Gayeski" <nick at wildfishconservancy.org>, r-help at r-project.org Received: Monday, May 21, 2012, 11:01 AM Hi Paul, I do not think that Nick's comment was really meant to be directed at you.? He is probably just tired of getting so many emails from R-help. Nick, to stop getting emails if you no longer want them, try following the link at the bottom of every single email you have received from R-help...you can unsubscribe yourself from there if you want.? If you like R-help but just do not like the quantity of emails, you could consider switching your subscription to a daily digest so you just get one email.? Alternately, you could create a special folder in your email for R-help messages, and create a filter that automatically sends all message from R-help to that special folder so you still have them all but they do not clutter up your inbox. Cheers, Josh On Mon, May 21, 2012 at 8:53 AM, Paul Miller <pjmiller_57 at yahoo.com> wrote:
Hi Nick, Can you elaborate (hopefully in a constructive way) on
what it is that you find objectionable about my post?
Thanks, Paul --- On Mon, 5/21/12, Nick Gayeski <nick at wildfishconservancy.org>
wrote:
From: Nick Gayeski <nick at wildfishconservancy.org> Subject: RE: [R] Complex text parsing task To: "'Paul Miller'" <pjmiller_57 at yahoo.com>,
r-help at r-project.org
Received: Monday, May 21, 2012, 10:36 AM Please stop sending these emails! -----Original Message----- From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Paul Miller Sent: Monday, May 21, 2012 8:32 AM To: r-help at r-project.org Subject: [R] Complex text parsing task Hello Everyone, I have what I think is a complex text parsing task.
I've
provided some sample data below. There's a relatively simple
version of
the coding that needs to be done and a more complex version. If
someone
could help me out with either version, I'd greatly appreciate it. Here are my sample data. haveData <- structure(list(profile_key = structure(c(1L, 1L,
2L, 2L, 2L,
3L, 3L, 4L, 4L,
5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001
",
"001-002 ", "001-003 ", "001-004 ", "001-005 ",
"001-006 ",
"001-007 " ), class = "factor"), encounter_date =
structure(c(9L, 10L,
11L, 12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label
= c("
2009-03-01 ", " 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", "
2010-10-15
", " 2010-11-15 ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ",
"
2011-10-24 ", " 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 " ), class = "factor"), raw = structure(c(9L, 12L,
16L, 13L,
10L, 7L, 6L, 3L,
2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c("
... If
patient KRAS result is wild type, they will start Erbitux. ... (Several
lines of
material) ... Ordered KRAS mutation test 11/11/2011. Results are
still not
available. ... ", " ... KRAS (mutated). Therefore did not
prescribe
Erbitux. ... ", " ... KRAS (mutated). Will not prescribe Erbitux due to
mutation.
... ", " ... KRAS (Wild). ...", " ... KRAS results are in.
Patient has
the mutation. ... ", " ... KRAS results still pending. Note that
patient was
negative for Lynch mutation. ...", " ... KRAS test results
pending. Note
that patient was negative for Lynch mutation. ...", " ... Ordered
KRAS
mutation testing on 02/15/2011. Results came back negative. ...
(Several lines
of material) ... Patient KRAS mutation test is negative. Will start
Erbitux.
...", " ... Ordered KRAS testing on 10/10/2010. Results not
yet
available. If patient has a mutaton, will start Erbitux. ...", " ...
Ordered KRAS
testing. Waiting for results. ...", " ... Patient is KRAS negative.
Started
Erbitux on 03/01/2011. ...", " ... Received KRAS results on
10/20/2010.
Test results indicate tumor is wild type. Ua Protein positve.
ER/PR
positive. HER2/neu positve. ...", " ... Still need to order KRAS
mutation
testing. ... ", " ... Tumor is negative for KRAS mutation. ...", " ...
Tumor is
wild type. Patient is eligible to receive Eribtux. ...", " ... Will
conduct
KRAS mutation testing prior to initiation of therapy with
Erbitux. ..."
), class = "factor")), .Names = c("profile_key",
"encounter_date", "raw"),
row.names = c(NA, -16L), class = "data.frame")
The following code displays the results of
so-called
"simple" coding.
#### Simple coding ####
KRASpatient <- c("001-001", "001-002",
"001-003",
"001-004", "001-005", "001-006",? "001-007") KRAStested <- c(2,3,2,2,2,3,3) KRASwild <- c(1,0,2,0,3,1,3) KRASmutant <- c(4,2,2,3,1,2,2) simpleData <- data.frame(KRASpatient, KRAStested, KRASwild,
KRASmutant)
simpleData Here, KRAStested is calculated by summing all
references to
"KRAS" for each patient. Wild is calculated by summing all
references to
"wild type", "wild", and "negative" that come within 20 words of
the
closest reference to KRAS. Mutant is calculated by summing all
references to
"mutant", "mutated", and "positive" that occur within 20 words of the
closest
reference to KRAS. The second kind of coding is what I'm referring to
as
"complex coding".? The following code displays the results of this type of
coding.
#### Complex coding #### KRAStested <- c(2,1,0,2,2,2,3) KRASwild <- c(1,0,0,0,3,0,3) KRASmutant <- c(0,0,0,3,0,1,0) complexData <- data.frame(KRASpatient,
KRAStested,
KRASwild, KRASmutant) complexData The results of "complex coding" differ
substantially from
those obtained under "simple coding" and I think illustrate the
potential
problems with that approach. With "complex coding", the goal
would be to
identify and sum only true references to KRAS testing and true
references to
the result of that testing (either wild type/negative or mutant/positive). True references to KRAS testing would be identified
using a
set of qualifiers that eliminate the false references. So,
for
example, one of the patients in my (made up) sample data has the phrase
"Will
conduct KRAS mutation testing prior to initiation of therapy
with
Erbitux" in their medical record. In this case, "Will" is a qualifier
that
indicates this is not a true reference to KRAS testing. For this
exercise,
other qualifiers related to KRAS testing would include "need",
"order" (but
not the past tense "ordered"), "wait", "waiting", "await", and "awaiting". To be a qualifier, these terms would need to occur
within 12
words of the closest true reference to KRAS. True references to the results of testing would
also be
identified using a set of qualifiers that eliminate false references.
Here the
list of qualifiers would include "if", "lynch", "kras
mutation
test", "kras mutation testing" and "for kras mutation". Qualifiers would
need to
come within 12 words of a true reference to KRAS testing. There's an additional wrinkle for identifying true references to the results of testing. One also needs to take into account the
presence
of what I'm calling "nullifiers". For purposes of this
exercise,
nullfiers include "Ua Protein", "ER/PR", and "HER2/neu" If "positive" or "negative" come closer to one of these words than to a true reference to
KRAS, then
they should not be used to identify the results of KRAS testing. Help with either type of coding would be greatly appreciated. Thanks, Paul
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained,
reproducible
code.
______________________________________________ R-help at r-project.org
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained,
reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/