Bert
Thank you for the link. Figured there might be something
Regarding your questions
This is from a large 53 Billion records. The column in question are
AdNames (Real Time Bidding data)
#1. Generally yes, but not always
#2 Separators could be underscores (_) or dots (.) as in 1.2.3_ABC ......
#3 Yes. So there could be Abc 123 could be a matching string
This would not be considered a match ...
abc_something
this.is_a long stringwithabcinthemiddle
The sequence(s) are always are at the beginning (or so it appears). Out
of the 54 billion records I am able to pull (SparkR sql) 948,679 unique
strings. It is from these unique strings that I (if possible) want to
identify the "key" strings.
1. Abc_1232.niok7j9hd
2. Abc
3. Abc.2#348hfk2.njilo
4. Abc.2
5. Abc.7
6. BAdfr_kajdhf98#kjsdh
7. BAdrf_gofer
948679 ....
So I may have a thousand individuals strings all of which have Abc as a
common string, or Badrf. So I am looking to pull "Abc," "BAdrf", etc. So
then I can go back and restructure the data to show that any record with
Abc_1232.niok7j9hd if part of the Abc "Group," or Family ???
Does that help
Jeff
-----Original Message-----
From: Bert Gunter <bgunter.4567 at gmail.com>
Sent: Friday, May 4, 2018 5:41 PM
To: reichmanj at sbcglobal.net
Cc: R-help <R-help at r-project.org>
Subject: Re: [R] Discovering patterns in textual strings
The answer is, of course, using regular expressions and/or libraries
therefor. However, I do not think you have defined your problem
sufficiently. Some questions I have:
1. Do possible patterns to be matched always appear at the beginning of
your strings?
2. Always together between specified separators ("_" in your example); or
one of several specified separators; or otherwise?
3. Do spaces or other nonprinting characters occur in your strings?
e.g. would
abc_something
this.is_a long stringwithabcinthemiddle
be considered matching?
There are undoubtedly other possibilities that I've missed.
You may also find it useful to check this "task view" out for
possibilities:
https://cran.r-project.org/web/views/NaturalLanguageProcessing.html
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Fri, May 4, 2018 at 3:25 PM, Jeff Reichman <reichmanj at sbcglobal.net>
wrote:
R Help Forum
Is there a R library (or a way) that I can extract unique character
strings, or repeating patterns in textual strings. Say for example I
have the following records:
Abc_1234_kjhksh_276
Abc
Abc_1234_lakdofyo_324
Bce_876_skdhk_*&^%*&
Bce
Bce_454
And I would like to see the following results
Abc
Abc_1234
Bce
Jeff Reichman
[[alternative HTML version deleted]]