Do grep() and strsplit() use different regex engines?

Sat, Jul 11, 2015 3:07 PM

David/Jeff:

Thank you both.

You seem to confirm that my observation of an "infelicity" in
strsplit() is real. That is most helpful.

I found nothing in David's message 2 code that was surprising. That
is, the splits shown conform to what I would expect from "\\b" . But
not to what I originally showed and David enlarged upon in his first
message. I still don't really get why a split should occur at every
letter.

Jeff may very well have found the explanation, but I have not gone
through his code.

If the infelicities noted (are there more?) by David and me are not
really bugs -- and I would be frankly surprised if they were -- I
would suggest that perhaps they deserve mention in the strsplit() man
page. Something to the effect that "\b and \< should not be used as
split characters..." .

Bert Gunter

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
   -- Clifford Stoll


On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius

<dwinsemius at comcast.net> wrote:

On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:

I noticed the following:

strsplit("red green","\\b")

[[1]]
[1] "r" "e" "d" " " "g" "r" "e" "e" "n"

After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result:

[[1]]
[1] "red"  " "  "green"

There is a warning in that paragraph: "(The interpretation of ?word? depends on the locale and implementation.)"

I got the expected result with only one of "\\>" and "\\<"

strsplit("red green","\\<")

[[1]]
[1] "r" "e" "d" " " "g" "r" "e" "e" "n"

strsplit("red green","\\>")

[[1]]
[1] "red"    " green"

The result with "\\<" seems decidedly unexpected.

I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
=======
An assertion-character can be any of the following:

        ? < ? Beginning of word
        ? > ? End of word
        ? b ? Word boundary
        ? B ? Non-word boundary
        ? d ? Digit character (equivalent to [[:digit:]])
        ? D ? Non-digit character (equivalent to [^[:digit:]])
        ? s ? Space character (equivalent to [[:space:]])
        ? S ? Non-space character (equivalent to [^[:space:]])
        ? w ? Word character (equivalent to [[:alnum:]_])
        ? W ? Non-word character (equivalent to [^[:alnum:]_])
========

The word-"word" appears nowhere else on that page.

strsplit("red green","\\W")

[[1]]
[1] "red"   "green"

`\W` matches the byte-width non-word characters. So the " "-character would be discarded.

I would have thought that "\\b" should give what "\\W" did. Note that:

grep("\\bred\\b","red green")

[1] 1
## as expected

Does strsplit use a different regex engine than grep()? Or more
likely, what am I misunderstanding?

Thanks.

Bert


David Winsemius
Alameda, CA, USA

Do grep() and strsplit() use different regex engines?

Thread (11 messages)