FR: valid_regex() to test string validity as a regular expression

Grepping an empty string might work in many cases...
That's precisely why a base R offering is important, as a surer way of
validating in all cases. To be clear I am trying to directly access the
results of tre_regcomp().
it is probably more portable to simply be prepared to propagate such
errors from the actual use on real inputs

That works best in self-contained calls -- foo(re) and we execute re inside
foo().

But the specific context where I found myself looking for a regex validator
is more complicated (https://github.com/r-lib/lintr/pull/2225). User
supplies a regular expression in a configuration file, only "later" is it
actually supplied to grepl().

Till now, we've done your suggestion -- just surface the regex error at run
time. But our goal is to make it friendlier and fail earlier at "compile
time" as the config is loaded, "long" before any regex is actually executed.

At a bare minimum this is a good place to return a classed warning (say
invalid_regex_warning) to allow finer control than tryCatch(condition=).

On Mon, Oct 9, 2023, 11:30?PM Tomas Kalibera <tomas.kalibera at gmail.com>
wrote:
On 10/10/23 01:57, Michael Chirico via R-devel wrote:

It will be useful to package authors trying to validate input which is
supposed to be a valid regular expression.

As near as I can tell, the only way we can do so now is to run any
regex function and check for the warning and/or condition to bubble
up:

valid_regex <- function(str) {
  stopifnot(is.character(str), length(str) == 1L)
  !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
}

That's pretty hefty/inscrutable for such a simple validation. I see a
variety of similar approaches in CRAN packages [1], all slightly
different. It would be good for R to expose a "canonical" way to run
this validation.

At root, the problem is that R does not expose the regex compilation
routines like 'tre_regcomp', so from the R side we have to resort to
hacky approaches.

Hi Michael,

I don't think you need compilation functions for that. If a regular
expression is found invalid by a specific third party library R uses, the
library should return and error to R and R should return an error to you,
and you should probably propagate that to your users. Grepping an empty
string might work in many cases as a test, but it is probably more portable
to simply be prepared to propagate such errors from the actual use on real
inputs. In theory, there could be some optimization for a particular case,
the checking may not be the same - but that is the same say for compilation
and checking.

Things get slightly complicated by encoding/useBytes modes
(tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
tre_regncompb; all in tre.h), but all are already present in other
regex routines, so this is doable.

Re encodings, simply R strings should be valid in their encoding. This is
not just for regular expressions but also for anything else. You shouldn't
assume that R can handle invalid strings in any reasonable way. Definitely
you shouldn't try adding invalid strings in tests - behavior with invalid
strings is unspecified. To test whether a string is valid, there is
validEnc() (or validUTF8()). But, again, it is probably safest to propagate
errors from the regular expression R functions (in case the checks differ,
particularly for non-UTF-8), also, duplicating the encoding checks can be a
non-trivial overhead.

If there was a strong need to have an automated way to somehow classify
specifically errors from the regex libraries, perhaps R could attach some
classes to them when the library tells.

Tomas

Exposing a function to compile regular expressions is common in other
languages, e.g. Go [2], Python [3], JavaScript [4].

[1] https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
[2] https://pkg.go.dev/regexp#Compile
[3] https://docs.python.org/3/library/re.html#re.compile
[4] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

______________________________________________R-devel at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-devel

FR: valid_regex() to test string validity as a regular expression

Thread (6 messages)