On StackOverflow (here: https://stackoverflow.com/questions/78803652/why-does-gsub-in-r-match-one-character-too-many) there was a question about this result: > gsub("^([0-9]{,5}).*","\\1","123456789") [1] "123456" The OP expected "12345" as the result. Several points were raised: - The R docs don't mention the case of {,5} for the default perl = FALSE which uses TRE. - perl = TRUE gives the OP's expected result of "12345". - perl = TRUE does *not* give the documented result on at least one system (which is "123456789", because "{,5}" is documented to not be a quantifier, so it should only match the literal string "{,5}"). - Some regexp engines (including Perl and Awk) document that "12345" is correct. Is any of this worth fixing? Duncan Murdoch
Question about regexp edge case
2 messages · Duncan Murdoch, Ivan Krylov
? Sun, 28 Jul 2024 20:02:21 -0400 Duncan Murdoch <murdoch.duncan at gmail.com> ?????:
gsub("^([0-9]{,5}).*","\\1","123456789")
[1] "123456"
This is in TRE itself: for "^([0-9]{,1})" tre_regexecb returns {.rm_so
= 0, .rm_eo = 1}, matching "1", but for "^([0-9]{,2})" and above it
returns an off-by-one result, {.rm_so = 0, .rm_eo = 3}.
Compiling with TRE_DEBUG, I see it parsed correctly:
catenation, sub 0, 0 tags
assertions: bol
iteration {-1, 2}, sub -1, 0 tags, greedy
literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
...but after tre_expand_ast I see
catenation, sub 0, 1 tags
assertions: bol
catenation, sub -1, 1 tags
tag 0
union, sub -1, 0 tags
literal empty
catenation, sub -1, 0 tags
literal (0, 9) (48, 57), pos 2, sub -1, 0 tags
union, sub -1, 0 tags
literal empty
catenation, sub -1, 0 tags
literal (0, 9) (48, 57), pos 1, sub -1, 0 tags
union, sub -1, 0 tags
literal empty
literal (0, 9) (48, 57), pos 0, sub -1, 0 tags
...which has one too many copies of "literal (0,9)". I think it's due
to the expansion loop on line 942 of src/extra/tre/tre-compile.c being
for (j = iter->min; j < iter->max; j++)
...where 'min' is -1 to denote no minimum. This is further confirmed by
"{0,3}", "{1,3}", "{2,3}", "{3,3}" all working correctly.
Neither TRE documentation [1] nor POSIX [2] specify the {,n} syntax:
from my reading, it looks like if the upper boundary is specified, the
lower boundary must be specified too. But if we do want to fix this, it
will have to be a special case for iter->min == -1.