Skip to content

[R-pkg-devel] [External] Re: UTF-8 and raw strings in package code

4 messages · Mark Bravington, Peter Dalgaard, iuke-tier@ey m@iii@g oii uiow@@edu

#
On Sun, Nov 30, 2025, at 13:10, luke-tierney at uiowa.edu wrote:
Fair enough. It might be easier than you suspect, though, since the parser already does the heavy lifting--- code below.

(i) If the file doesn't even parse, that's a more serious problem! 

(ii) If the file does parse OK, then AFAICS the only places that non-ASCII characters might be lurking are: (a) in comments, where they are somewhat grudgingly allowed IIRC; (b) in string literals, where we would like to allow them;  and of course (c) in symbols (variable names; see notes below), where we DON'T want them if it's a package. And this can all be checked easily from $parseData. My specimen function below does it in ~20 lines of "real" code.

A couple of notes:

#1 I didn't realize that it is even possible to have a "normal" (ie non-backticked) variable name with non-ASCII letters (see ?Quotes, "Names and Identifiers"). And indeed I can run the following in my (Anglo) Windows RGUI:

fran?ais <- 'bon'

Crikey, that's actually scary... Anyway,  the intention is clearly to NOT allow that in package code, at least not yet.

#2 Should packages nevertheless be allowed to use backticked identifiers containing non-ASCII characters? (IME backticks are often used for funny names with all-ASCII characters but in the wrong places.) Personally I'd vote no, but it's well above my pay grade--- and there's no voting in R. Anyhow, my code below has an option to check/not-check backticked symbols.

Is this likely to be acceptable? If so I'll try to submit a formal patch.

cheers
Mark


## My function:

check_ASCII_code_MVB <- function( 
    file, pp= NULL, check_backticks= FALSE
){
  # Checks that any non-ASCII UTF-8 characters are confined to 
  # string-literals & comments
  
  # Can directly supply results of previous parse(), for speed
  if( is.null( pp)){ # ... or, if not:
    pp <- try( parse( file=file, keep.source=TRUE, encoding='UTF-8'))
    if( inherits( pp, 'try-error')){
      warning( "Can't even parse, let alone check for non-ASCII")
return( FALSE)
    }
  }
  
  # Get tokens of "leaf" (terminal) elements, and associated text
  # This mimicks utils::getParseData()
  ppd <- pp |> attr( 'wholeSrcref') |> attr( 'srcfile') |>
    _$parseData |> attributes() |> _[ c( 'tokens', 'text')]
  
  symbols <- with( ppd, 
      text[ grepl( 'SYMBOL', tokens, fixed=TRUE)])
    
  if( !check_backticks){
    # Not obvious whether to allow UTF-8 in backticked names
    
    # AFAICS backticks can only occur both at start and end of a parsable symbol
    backy <- startsWith( symbols, r"{`}") & endsWith( symbols, r"{`}")
    symbols <- symbols[ !backy]
  }
  
  non_ASCII <- .Call( tools:::C_nonASCII, symbols)
  
  OK <- !any( non_ASCII)
  if( !OK){
    attr( OK, 'offending_symbols') <- unique( symbols[ non_ASCII])
  }
return( OK)
}

## A snippet to save into a file, for testing. Note the raw string: irrelevant, but useful.

nonASCII_R <- r"--{
  fran?ais <- 'bon'
  `fran?ais` <- 'bon'
  lingo <- "fran?ais"
  # Nothing wrong with a bit of fran?ais in comments
}--" |> strsplit( '\n') |> _[[1]]

writeLines( nonASCII_R, <file of your choice>)


## Possible patch of tools::.check_package_ASCII_code :

.check_package_ASCII_code_patch <- function (
  dir, respect_quotes = FALSE
){
    if (!dir.exists(dir)) 
        stop(gettextf("directory '%s' does not exist", dir), 
            domain = NA)
    dir <- file_path_as_absolute(dir)
    wrong_things <- character()
    for (f in c(file.path(dir, "NAMESPACE"), list_files_with_type(file.path(dir, 
        "R"), "code", OS_subdirs = c("unix", "windows")))) {
## OLD        
        #text <- readLines(f, warn = FALSE)
        # if (.Call(C_check_nonASCII, text, respect_quotes)) 
## NEW        
        if( !check_ASCII_code_MVB( f))
            wrong_things <- c(wrong_things, f)
    }
    if (length(wrong_things)) {
        wrong_things <- substring(wrong_things, nchar(dir) + 
            2L)
        cat(wrong_things, sep = "\n")
    }
    invisible(wrong_things)
}
#
Maybe scary, but part of the R idiom is that plots, etc get auto-labeled with the name of the variables. If I want to do a child-vs-parents' income chart in Danish, it becomes "b?rn" and "for?ldre". And such names can be column names in datasets, etc. You can work around it but why should you? 

So, for local usage, it is quite sensible to allow extended character sets. 

For packages (and other distributed materials) probably not so. It is probably the language and not actually the character set you want to restrict, though. 

-pd

  
    
2 days later
#
On Mon, 1 Dec 2025, Mark Bravington wrote:

            
You still have to handle it in a way that is consistent with the rest
of the checking process, which I believe means catching the error and
returning FALSE. I would use tryCatch for that.
It isn't quite right though: symbols can appear in a few other
places. Look at

     function(x = y) g(z = w)

I believe you are only picking up two of the five symbols you want.

Also you can simplify your code by using getParseData. I would also
avoid using the pipe operator since it isn't consistent with the
coding style in the file you are proposing to change.
It is worth putting together a clean and well-tested patch that can be
easily reviewed and tested by others. There are folks who spend much
more time than I do on the QC code and may see reasons why going down
this road is a bad idea, or how to do this better, but we'll see.

Best,

luke

  
    
3 days later
#
Thanks, that's helpful. I'll tidy up and submit the patch and que sera, sera.

FWIW my code does already trap parse-errors, and in fact it will check all 5 "symboloids" in your snippet      function(x = y) g(z = w) . The $parseData reports SYMBOL, SYMBOL_SUB, SYMBOL_FUNCTION_CALL, SYMBOL_FORMALS for the different cases, which are all caught by grepl().

I initially did use getParseData() but then realized it wasn't adding anything (except time). Anyway, I'll leave both options in there, de-piped, for people to consider.

cheers
Mark
On Thu, Dec 4, 2025, at 09:20, luke-tierney at uiowa.edu wrote: