Removing words and initials with tm
Thanks Jeff. I'll add that to the ever-growing list my current studies are generating daily. :-) Cheers S
On 10/04/15 14:32, Jeff Newmiller wrote:
"I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those."
I cannot think of a better incentive to take action on this hole in your education and buckle down to learn regular expressions. There are many books and tutorials available.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil at dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
---------------------------------------------------------------------------
Sent from my phone. Please excuse my brevity.
On April 10, 2015 3:19:51 AM PDT, Sun Shine <phaedrusv at gmail.com> wrote:
Hi list Using the tm package, part of the pre-processing work is to remove words, etc. from the corpus. I wish to remove people's names and also their initials which are peppered throughout the corpus. But, because some people's initials are the same as parts of common words - e.g. 'am' = 'became' => 'bec e' or 'ec' = 'because' => 'b ause' or 'ar' = 'arrival' => 'rival' (which has a completely different meaning). Is there any way of doing this without leaving a trail of nonsense half-terms behind? I suspect that it might have something to do with regular expressions, but to be honest, I'm (currently) pretty crap with those. Would it make a difference if I removed initials and names *prior* to converting all text to lower case, so I remove 'AM' and because 'became' is lower case, it should remain unaffected? Any recommendations on how best to proceed with this? Thanks as always. Sun
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.