Skip to content

extracting a matched string using regexpr Possible BUG

6 messages · David Winsemius, steven mosher, Simon Urbanek

#
On May 6, 2010, at 2:21 AM, steven mosher wrote:

            
Except we both were using \\d rather than //d.

I believe that Steve is using R 2.11.0 but I am still using R 2.10.1  
(but with the release of an Hmisc upgrade I will convert soon.)
#
FWIW I don't think \d is a basic regexp so as I would expect the perl mode to work and it does:
[1] "12345"

Yet I agree that if should either fail (i.e. return the unmodified string) or return 12345.

Also note that the bug is locale-specific:

LANG=C R
[1] "12345"
[1] "12345"

Also note that this is not Mac-specific:
[1] "WWWWW"
Linux 2.6.32-trunk-amd64
[1] "en_US.UTF-8"


Cheers,
Simon
On May 6, 2010, at 6:54 AM, David Winsemius wrote:

            
#
Two Q's:
A) Is this supposed to happen with perl-mode?:

 > test<-"</tr><tr><th>88958</th><th>Abcdsef</th><th>67.8S</ 
th><th>68.9\nW</th><th>26m</th>"
 >
 > sub(".*(\\d{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"
 >
 > sub(".*([0-9]{5}).*", "\\1", test, perl=TRUE)
[1] "88958\nW</th><th>26m</th>"

Looks to me that a period is being improperly recognized.
On May 6, 2010, at 11:28 AM, Simon Urbanek wrote:

            
B) With regard to the default (which I read to be  extended rather  
than basic) vs. perl-like, the Extended section of the regex  
documentation contains:

" Symbols \d, \s, \D and \S denote the digit and space classes and  
their negations."
David Winsemius, MD
West Hartford, CT
#
On May 6, 2010, at 11:50 AM, David Winsemius wrote:

            
Nope - perl does take EOL into account so .* will be matched only to the end of line. For your purposes you want to enable ?s option, so you probably meant:
[1] "88958"
Yes, you're right - extended is the default.

Cheers,
Simon