On 10/10/23 01:57, Michael Chirico via R-devel wrote:
It will be useful to package authors trying to validate input which is
supposed to be a valid regular expression.
As near as I can tell, the only way we can do so now is to run any
regex function and check for the warning and/or condition to bubble
up:
valid_regex <- function(str) {
stopifnot(is.character(str), length(str) == 1L)
!inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
}
That's pretty hefty/inscrutable for such a simple validation. I see a
variety of similar approaches in CRAN packages [1], all slightly
different. It would be good for R to expose a "canonical" way to run
this validation.
At root, the problem is that R does not expose the regex compilation
routines like 'tre_regcomp', so from the R side we have to resort to
hacky approaches.
Hi Michael,
I don't think you need compilation functions for that. If a regular
expression is found invalid by a specific third party library R uses, the
library should return and error to R and R should return an error to you,
and you should probably propagate that to your users. Grepping an empty
string might work in many cases as a test, but it is probably more portable
to simply be prepared to propagate such errors from the actual use on real
inputs. In theory, there could be some optimization for a particular case,
the checking may not be the same - but that is the same say for compilation
and checking.
Things get slightly complicated by encoding/useBytes modes
(tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
tre_regncompb; all in tre.h), but all are already present in other
regex routines, so this is doable.
Re encodings, simply R strings should be valid in their encoding. This is
not just for regular expressions but also for anything else. You shouldn't
assume that R can handle invalid strings in any reasonable way. Definitely
you shouldn't try adding invalid strings in tests - behavior with invalid
strings is unspecified. To test whether a string is valid, there is
validEnc() (or validUTF8()). But, again, it is probably safest to propagate
errors from the regular expression R functions (in case the checks differ,
particularly for non-UTF-8), also, duplicating the encoding checks can be a
non-trivial overhead.
If there was a strong need to have an automated way to somehow classify
specifically errors from the regex libraries, perhaps R could attach some
classes to them when the library tells.
Tomas
Exposing a function to compile regular expressions is common in other
languages, e.g. Go [2], Python [3], JavaScript [4].
[1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
[2] https://pkg.go.dev/regexp#Compile
[3] https://docs.python.org/3/library/re.html#re.compile
[4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp