Regular expressions & large strings (PR#6617)
Prof Brian Ripley writes:
I was able to confirm the error on RH8.0 Linux and the segfault on Windows. Note that PCRE is not being used, and if you add perl=TRUE to your [g]sub calls you get correct results extremely fast.
Thanks for clarifying that; I hadn't realised.
The segfault is occurring in regexec, that is in the GNU regex code included in R. I am not clear it is worth spending any time on trying to find the problem in that code as - you can use perl=TRUE as an alternative - we will be replacing the GNU regex code in due course to cope with internationalization issues.
Sounds fine. Do you think either of the following are worth
doing in the meantime?
- Add an strsplit() variant with PCRE (perhaps this
problem is be related to PR#6601; and the speed might be
nice anyway).
- Add options(pcre) so the potentially bad code can be
avoided without explicitly setting perl=TRUE every time.
Mark <><