Removing words and initials with tm
Hi Jim The name's come up on my radar, but that's about it. I'll look into it. Thanks for the reference. All the best S
On 10/04/15 23:36, Jim Lemon wrote:
Hi Sun,
No, I was thinking of something like hunspell, which seems to fit into
the sort of work that you are doing.
Jim
On Fri, Apr 10, 2015 at 11:42 PM, Sun Shine <phaedrusv at gmail.com
<mailto:phaedrusv at gmail.com>> wrote:
Thanks Jeff.
I'll add that to the ever-growing list my current studies are
generating daily. :-)
Cheers
S
On 10/04/15 14:32, Jeff Newmiller wrote:
"I suspect that it might have something to do with regular
expressions, but to be honest, I'm (currently) pretty crap
with those."
I cannot think of a better incentive to take action on this
hole in your education and buckle down to learn regular
expressions. There are many books and tutorials available.
---------------------------------------------------------------------------
Jeff Newmiller The ..... .....
Go Live...
DCN:<jdnewmil at dcn.davis.ca.us
<mailto:jdnewmil at dcn.davis.ca.us>> Basics: ##.#.
##.#. Live Go...
Live: OO#.. Dead:
OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#.
rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On April 10, 2015 3:19:51 AM PDT, Sun Shine
<phaedrusv at gmail.com <mailto:phaedrusv at gmail.com>> wrote:
Hi list
Using the tm package, part of the pre-processing work is
to remove
words, etc. from the corpus.
I wish to remove people's names and also their initials
which are
peppered throughout the corpus. But, because some people's
initials are
the same as parts of common words - e.g. 'am' = 'became'
=> 'bec e' or
'ec' = 'because' => 'b ause' or 'ar' = 'arrival' =>
'rival' (which has
a
completely different meaning).
Is there any way of doing this without leaving a trail of
nonsense
half-terms behind? I suspect that it might have something
to do with
regular expressions, but to be honest, I'm (currently)
pretty crap with
those.
Would it make a difference if I removed initials and names
*prior* to
converting all text to lower case, so I remove 'AM' and
because
'became'
is lower case, it should remain unaffected?
Any recommendations on how best to proceed with this?
Thanks as always.
Sun
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org> mailing
list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,
reproducible code.
______________________________________________
R-help at r-project.org <mailto:R-help at r-project.org> mailing list --
To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.