Skip to content

Question about regexp edge case

4 messages · Duncan Murdoch, Tomas Kalibera

#
On 7/29/24 09:37, Ivan Krylov via R-devel wrote:
Thanks. It seems that TRE is now maintained again upstream, so it would 
be best to discuss this with TRE maintainers directly (if not already 
solved by https://github.com/laurikari/tre/pull/98).

The same applies to any other open TRE issues.

Best Tomas
#
Thanks Tomas.  Do note that my original post also mentioned a bug or doc 
error in the PCRE docs for this regexp:
Duncan
On 2024-08-01 6:49 a.m., Tomas Kalibera wrote:
7 days later
#
On 8/1/24 20:55, Duncan Murdoch wrote:
This is a change in documented behavior in PCRE. PCRE2 10.43 
(share/man/man3/pcre2pattern.3) says:

"If the first number is omitted, the lower limit is taken as zero; in 
this case the upper limit must be present. X{,4} is interpreted as 
X{0,4}. In earlier versions such a sequence was not interpreted as a 
quantifier. Other regular expression engines may behave either way."

And the changelog:

"29. Perl 5.34.0 changed the meaning of (for example) {,3} which did not 
used to be treated as a quantifier. Now it is interpreted as {0,3} and 
PCRE2 has changed to match. Note that {,} is still not a quantifier."

Sadly the previous behavior was also documented in pcre2pattern.3:

"For example, {,6} is not a quantifier, but a literal string of four 
characters"

I've confirmed with R built with PCRE2 10.42, 10.43 and 10.44. In 
practice, users would most likely see the new behavior on Windows, where 
Rtools44 has PCRE2 10.43.

The R documentation (?regex) refers to the PCRE2 documentation for 
"complete details", mentioning how to find out what is the version of 
PCRE(2) used.? I've now added a warning about that PCRE behavior may 
change between versions, with the {,m} as an example. I don't think we 
can do much more - I don't think we should be replicating the PCRE 
documentation/changelog - but we could add more examples, if any 
important appear. Also, we don't want to write R programs that depend on 
concrete versions of PCRE.

It is a good thing that ?regex doesn't document "{,m}", because it 
cannot be used reliably/portably. One should use some of the documented 
forms, instead, i.e. "{0,m}". Indeed there is the problem of how to use 
only the documented subset of behavior (in ?regex), because one also 
needs to avoid accidentally running into undocumented expressions with 
special meaning, like in this case. But perhaps still authors could try 
to defensively avoid risky expressions in literals in patterns, such as 
those involving "{}" or otherwise similar to documented expressions with 
a special meaning.

Best
Tomas
#
Thanks!  I think your suggested additions to the docs are perfect.

Duncan Murdoch
On 2024-08-09 5:01 a.m., Tomas Kalibera wrote: