Skip to content
Prev 164702 / 398506 Next

gregexpr - match overlap mishandled (PR#13391)

Greg Snow wrote:
(we have just had an offline communication, it should be fine to keep it
that way)
that's right, there are the dfa and the nfa approaches (and more), and
the latter is the regex-driven approach of perl and many other
implementations.  they work differently, and also differ in what you can
do with them.

i was talking about updating the string pointer to the position where
the next match in a global (iterative) matching process can start. 
while the nfa approach moves, in general, back and forth through the
string, and the dfa approach steps through the string linearly keeping a
list of possible matches, the end result is the same:  if there is a
match, the next match will not start earlier than after the end of the
current match.  what a dfa and an nfa engine will actually match given a
text and a pattern may differ:

perl -e 'print "ab" =~ /a|ab/g'
# a

echo ab | egrep -o 'a|ab'
# ab

but none of them will report overlapping matches:

perl -e 'print "aaaaa" =~ /aa/g
# 2 matches

echo aaaaa | egrep -o aa
# 2 matches

(afaik, egrep uses a posix-compliant dfa engine) 

to achieve the effect of overlapping matches, you either need to
manually move the pointer while the global match proceeds, or
sequentially perform single matches on successive substrings of the
input string (which can give you the same match more than once,
though).  it appears that my earlier suggestion was flawed, the
following is a bit cleaner:

$string = # some string
$pattern = # some pattern

@matches = ();
while ($string =~ /$pattern/g) { push @matches, [$-[0], $&];
pos($string) -= (length($&) - 1) }

after each successful match, it moves to the position right after the
start of the successful match.


the following will capture all possible matches for all alternatives in
the regex (or so it seems):

$string =~/(?:$pattern)(??{ push @matches, [$-[0], $&] })/g

so that "aabb" =~ /(?:a|abb?)(??{ push @matches, [$-[0], $&] })/g will
give 4 matches.

again, not sure if this can be done within r.

vQ