Do grep() and strsplit() use different regex engines?
David/Jeff: Thank you both. You seem to confirm that my observation of an "infelicity" in strsplit() is real. That is most helpful. I found nothing in David's message 2 code that was surprising. That is, the splits shown conform to what I would expect from "\\b" . But not to what I originally showed and David enlarged upon in his first message. I still don't really get why a split should occur at every letter. Jeff may very well have found the explanation, but I have not gone through his code. If the infelicities noted (are there more?) by David and me are not really bugs -- and I would be frankly surprised if they were -- I would suggest that perhaps they deserve mention in the strsplit() man page. Something to the effect that "\b and \< should not be used as split characters..." . Bert Gunter "Data is not information. Information is not knowledge. And knowledge is certainly not wisdom." -- Clifford Stoll On Sat, Jul 11, 2015 at 11:05 AM, David Winsemius
<dwinsemius at comcast.net> wrote:
On Jul 11, 2015, at 7:47 AM, Bert Gunter wrote:
I noticed the following:
strsplit("red green","\\b")
[[1]] [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
After reading the ?regex help page, I didn't understand why `\b` would split within sequences of "word"-characters, either. I expected this to be the result: [[1]] [1] "red" " " "green" There is a warning in that paragraph: "(The interpretation of ?word? depends on the locale and implementation.)" I got the expected result with only one of "\\>" and "\\<"
strsplit("red green","\\<")
[[1]] [1] "r" "e" "d" " " "g" "r" "e" "e" "n"
strsplit("red green","\\>")
[[1]]
[1] "red" " green"
The result with "\\<" seems decidedly unexpected.
I'm wondered if the "original" regex documentation uses the same language as the R help page. So I went to the cited website and find:
=======
An assertion-character can be any of the following:
? < ? Beginning of word
? > ? End of word
? b ? Word boundary
? B ? Non-word boundary
? d ? Digit character (equivalent to [[:digit:]])
? D ? Non-digit character (equivalent to [^[:digit:]])
? s ? Space character (equivalent to [[:space:]])
? S ? Non-space character (equivalent to [^[:space:]])
? w ? Word character (equivalent to [[:alnum:]_])
? W ? Non-word character (equivalent to [^[:alnum:]_])
========
The word-"word" appears nowhere else on that page.
strsplit("red green","\\W")
[[1]] [1] "red" "green"
`\W` matches the byte-width non-word characters. So the " "-character would be discarded.
I would have thought that "\\b" should give what "\\W" did. Note that:
grep("\\bred\\b","red green")
[1] 1 ## as expected Does strsplit use a different regex engine than grep()? Or more likely, what am I misunderstanding? Thanks. Bert
David Winsemius Alameda, CA, USA