-----Original Message-----
From: Wacek Kusnierczyk [mailto:Waclaw.Marcin.Kusnierczyk at idi.ntnu.no]
Sent: Friday, December 12, 2008 4:08 PM
To: Greg Snow
Cc: R help
Subject: Re: [Rd] gregexpr - match overlap mishandled (PR#13391)
Greg Snow wrote:
Where do you get "should" and "expect" from? All the regular
expression tools that I am familiar with only match non-overlapping
patterns unless you do extra to specify otherwise. One of the standard
references for regular expressions if you really want to understand
what is going on is "Mastering Regular Expressions" by Jeffrey Friedl.
You should really read through that book before passing judgment on the
correctness of an implementation.
If you want the overlaps, you need to come up with a regular
expression that will match without consuming all of the string. Here
is one way to do it with your example:
> gregexpr("1122(?=1122)", paste(rep("1122", 10), collapse=""),
[[1]]
[1] 1 5 9 13 17 21 25 29 33
attr(,"match.length")
[1] 4 4 4 4 4 4 4 4 4
another option would be to move the anchor backwards after each match,
but i'm not sure if the problem really needs it and if it could be done
from within r.
greg (and another person who answered this post earlier):
while your frustration is understandable, i think reid (and possibly
other users as well) would benefit from a brief explanation instead of
your emotional reactions. you ought to be more patient and less
arrogant with newbies who will often think there is a bug in r when
there isn't.
reid:
when matching is performed, there is a pointer moved through the
string. in global matching, after a match is found the pointer is just
behind the matched substring, and further matching proceeds from there.
for example example, suppose you match "aaa" (the string) with "aa"
(the
pattern) globally. after the first successful match, the position
pointer is *behind the second a* in the string, and no further match
can
be found from there.in this context, 'global' does not mean that all
possible matches are found, rather that matching is performed
iteratively.
the above is probably a solution to your problem, though the matches
have length 4, not 8. in perl, you could manually move back the anchor
after each match, e.g.:
$string = "1122" x 10;
$n = length($string)/2;
@matches = ();
$string =~ /11221122(??{push @matches [$-[0], $&]; pos($s) -= $n})/g;
now @matches has 9 elements, each a ref to an array with the starting
position and the content (of length 8) of the respective match:
@matches = ([0, "11221122"], [4, "11221122"], ...)
not sure if you can do this within r. not sure if you'll ever need it.
for more complex cases when you need overlapping matches and you need
their content, greg's solution might not do, but in general that's the
solution.
vQ