Skip to content
Prev 374455 / 398513 Next

Discovering patterns in textual strings

"Does that help?"

No. I am not your private consultant. You need to reply to the list, which
I have cc'ed here, not just me.

I am still somewhat confused by your specifications, but others may not be.
Part of my confusion stems from your failure to provide a reproducible
example (see e.g. the posting guide linked below).  For example, I cannot
tell from your text whether the Abc and Bce strings contain one or more
spaces at the end. I shall assume they may but need not.

Anyway, here is a reproducible example and solution that assumes that the
substrings/patterns of interest to you occur at the beginning of the
strings and may or may not be followed by one of "." "_" or " "(space) and
then possibly further text which should be ignored. Assuming that you are
familiar with regular expressions, maybe this will help to get you started
even if I have misunderstood your specifications. If you aren't familiar
with regex's, maybe the stringr package may provide a gentler interface
than using R's raw regex functionality. Or maybe someone else can suggest a
better approach (which is another reason why you should reply to the list,
not just me).

z <- c("abc",
       "abc_def",
       "abc.def",
       "abc def",
       "abcd_ef",
       "abcd",
       "e","f")

pats <- unique(sub("^(.+)[. _]+.*", "\\1", z))
## gives:
[1] "abc"  "abcd" "e"    "f"


This gives you the four separate patterns that you could then use to group
your records, perhaps by:
[[1]]
[1] 1 2 3 4

[[2]]
[1] 5 6

[[3]]
[1] 7

[[4]]
[1] 8

That is, indices 1-4 in z are the first group; 5 and 6 are the second; etc.



Cheers,
Bert

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )

On Fri, May 4, 2018 at 9:00 PM, Jeff Reichman <reichmanj at sbcglobal.net>
wrote: