Skip to content

gregexpr (PR#9965)

4 messages · dolanp at science.oregonstate.edu, Greg Snow, Brian Ripley +1 more

#
Full_Name: Peter Dolan
Version: 2.5.1
OS: Windows
Submission from: (NULL) (128.193.227.43)


gregexpr does not find all matching substrings if the substrings overlap:
[[1]]
[1] 1
attr(,"match.length")
[1] 4

It does work correctly in Version 2.3.1 under linux.
#
If you want all the matches (including overlaps) then you could try one
of these:
[[1]]
[1] 1 3
attr(,"match.length")
[1] 0 0
[[1]]
[1] 1 3
attr(,"match.length")
[1] 2 2

The book "Mastering Regular Expressions" by Jeffrey Friedl has a lot of
detail on the hows and whys of regular expression matching.
#
This was a deliberate change for R 2.4.0 with SVN log:

r38145 | rgentlem | 2006-05-20 23:58:14 +0100 (Sat, 20 May 2006) | 2 lines
fixing gregexpr infelicity

So it seems the author of gregexpr believed that the bug was in 2.3.1, not 
2.5.1.
On Wed, 10 Oct 2007, dolanp at science.oregonstate.edu wrote:

            
'correctly' is a matter of definition, I believe: this could be considered 
to be vaguely worded in the help.

  
    
#
Yes, we had originally wanted it to find all matches, but user 
complaints that it did not perform as Perl does were taken to prevail. 
There are different ways to do this, but it seems the notion that one 
not start looking for the next match until after the previous one is 
more common.  I did consciously decide not to have a switch, and instead 
we wrote something that does what we wanted it to do and put it in the 
Biostrings package (from Bioconductor) as geregexpr2 (sorry but only 
fixed = TRUE is supported, since that is all we needed).

best wishes
   Robert
Prof Brian Ripley wrote: