Skip to content

Discovering patterns in textual strings

2 messages · Jeff Reichman, Bert Gunter

#
R Help Forum

 

Is there a R library (or a way) that I can extract unique character strings,
or repeating patterns in textual strings.  Say for example I have the
following records:

 

Abc_1234_kjhksh_276

Abc

Abc_1234_lakdofyo_324

Bce_876_skdhk_*&^%*&

Bce

Bce_454

 

And I would like to see the following results

Abc

Abc_1234

Bce

 

 

Jeff Reichman
#
The answer is, of course, using regular expressions and/or libraries
therefor. However, I do not think you have defined your problem
sufficiently. Some questions I have:

1. Do possible patterns to be matched always appear at the beginning
of your strings?

2. Always together between specified separators ("_"  in your
example); or one of several specified separators; or otherwise?

3. Do spaces or other nonprinting characters occur in your strings?

e.g. would

abc_something
this.is_a long stringwithabcinthemiddle

be considered matching?
There are undoubtedly other possibilities that I've missed.

You may also find it useful to check this "task view" out for possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net> wrote: