Skip to content

Minimal match to regexp?

7 messages · Andrew Simmons, Jeff Newmiller, Duncan Murdoch

#
The docs for ?regexp say this:  "By default repetition is greedy, so the 
maximal possible number of repeats is used. This can be changed to 
?minimal? by appending ? to the quantifier. (There are further 
quantifiers that allow approximate matching: see the TRE documentation.)"

I want the minimal match, but I don't seem to be getting it.  For example,

x <- "abaca"
grep("a.*?a", x, value = TRUE)
#> [1] "abaca"

Shouldn't I have gotten "aba", which is the first match to "a.*a"?  If 
not, what would be the regexp that would give me the first match to 
"a.*a", without greedy expansion of the .*?

Duncan Murdoch
#
grep(value = TRUE) just returns the strings which match the pattern. You
have to use regexpr() or gregexpr() if you want to know where the matches
are:

```
x <- "abaca"

# extract only the first match with regexpr()
m <- regexpr("a.*?a", x)
regmatches(x, m)

# or

# extract every match with gregexpr()
m <- gregexpr("a.*?a", x)
regmatches(x, m)
```

You could also use sub() to remove the rest of the string:
`sub("^.*(a.*?a).*$", "\\1", x)`
keeping only the match within the parenthesis.
On Wed, Jan 25, 2023, 19:19 Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

            

  
  
#
On 25/01/2023 7:19 p.m., Duncan Murdoch wrote:
Sorry, that was a dumb question.  Of course grep returned the whole 
thing.  I should be using regexpr() or some related function to extract 
the match.

Duncan Murdoch
#
Perhaps

sub( "^.*(a.*?a).*$", "\\1", x )
On January 25, 2023 4:19:01 PM PST, Duncan Murdoch <murdoch.duncan at gmail.com> wrote:

  
    
#
Thanks for pointing out my mistake.  I oversimplified the real problem.

I'll try to post a version of it that comes closer:  Suppose I have a 
string like this:

x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"

If I cat() it, I see that it is really markdown source:

   ```html
   blah blah
   ```

   ```r
   blah blah
   ```

I want to find the part that includes the html block, but not the r 
block.  So I want to match "```html", followed by a minimal number of 
characters, then "```".  Then this pattern works:

   pattern <- "\n```html\n.*?\n```\n"

and we get the right answer:

   cat(regmatches(x, regexpr(pattern, x)))

   ```html
   blah blah
   ```

Okay, but this flavour of markdown says there can be more backticks, not 
just 3.  So the block might look like

   ````html
   blah blah
   ````

I need to have the same number of backticks in the opening and closing 
marker.  So I make the pattern more complicated, and it doesn't work:

   pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

This matches all of x:

   > pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"
   > cat(regmatches(x, regexpr(pattern2, x)))

   ```html
   blah blah
   ```

   ```r
   blah blah
   ```


Is that a bug, or am I making a silly mistake again?

Duncan Murdoch
On 25/01/2023 7:34 p.m., Andrew Simmons wrote:
#
It seems like a bug to me. Using perl = TRUE, I see the desired result:

```
x <- "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n"

pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

cat(regmatches(x, regexpr(pattern2, x, perl = TRUE)))
```

If you change it to something like:

```
x <- c(
    "\n```html\nblah blah \n```\n\n```r\nblah blah\n```\n",
    "\n```html\nblah blah \n```\n"
)

pattern2 <- "\n([`]{3,})html\n.*?\n\\1\n"

print(regmatches(x, regexpr(pattern2, x)), width = 10)
```

you can see that it does find the match, so the combination of *? and
\\1 must be messing up regexpr(). They seem to work perfectly fine on
their own.
On Wed, Jan 25, 2023 at 7:57 PM Duncan Murdoch <murdoch.duncan at gmail.com> wrote:
#
I'll submit a bug report.
On 25/01/2023 8:38 p.m., Andrew Simmons wrote: