Skip to content

Non-ASCII citation keys prevent compiling with LC_ALL=C

3 messages · Kurt Hornik, Ivan Krylov

#
Hello R-devel,

I've been watching the development of automatic Rd bibliography
generation with great interest and I'm looking forward to using
\bibcitet{...} and \bibshow{*} in my packages. Currently, non-ASCII
characters used in the citation keys prevent R from successfully
compiling when the current locale encoding is unable to represent them:

% touch src/library/stats/man/factanal.Rd && LC_ALL=C make
...
installing parsed Rd
make[3]: Entering directory '.../src/library'
  base
Error: factanal.Rd:99: (converted from warning) Could not find
bibentries for the following keys: %s
  'R:J<U+00F6>reskog:1963'
Execution halted
make[3]: *** [Makefile:76: stats.Rdts] Error 1

But as long as the locale encoding can represent the key, it's fine:

% touch src/library/stats/man/factanal.Rd && \
 LC_ALL=en_GB.iso885915 luit make
(works well without a UTF-8 locale)

I think this can be made to work by telling tools:::process_Rd() ->
tools:::processRdChunk() to parse character strings in R code as UTF-8:

Index: src/library/tools/R/RdConv2.R
===================================================================
--- src/library/tools/R/RdConv2.R	(revision 88617)
+++ src/library/tools/R/RdConv2.R	(working copy)
@@ -229,8 +229,8 @@
 	code <- structure(code[tags != "COMMENT"],
 	                  srcref = codesrcref) # retain for error locations
 	chunkexps <- tryCatch(
-	    parse(text = sub("\n$", "", as.character(code)),
-	          keep.source = options$keep.source),
+	    parse(text = sub("\n$", "", enc2utf8(as.character(code))),
+	          keep.source = options$keep.source, encoding = "UTF-8"),
 	    error = function (e) stopRd(code, Rdfile, conditionMessage(e))
 	)
 
That enc2utf8() may be extraneous, since tools::parse_Rd() is
documented to convert text to UTF-8 while parsing. The downsides are,
of course, parse(encoding=...) not working with MBCS locales and the
ever-present danger of breaking some user code that depends on the
current behaviour (this was tested using 'make check-devel', not on
CRAN packages).

Should R compile under LC_ALL=C? Maybe it's time for people whose
builds are failing to switch the continuous integration containers from
C to C.UTF-8?
#
Thanks! :-)
Oh dear.  I thought we have coverage for this from building daily
snapshots with LC_ALL=C, but apparently not.  There were 10 non-ASCII
keys so far: I have for now changed them to all ASCII.

But clearly, when a package declares its Rd files to be in UTF-8 one
would expect that Sexpr macros can also take UTF-8, but that's not so
simple given that it involves calling the R parser.  Your suggested
change looks good to me: non-UTF-8 MBCS locales have a problem with
parse(encoding = "UTF-8"), but I don't think we have real coverage for
these.

(Afaic, in principle, it might be nice to make these "work" via writing
to a tempfile, parsing from their with re-encoding, and at the end run
enc2utf8() on all strings obtained, but that's not so simple ...)

Anyway, need to discuss this a bit more within R Core.  For now, things
"work" again with LC_ALL=C.  

(My regular checks use C.UTF-8, but I am not sure how universally
available this is?)

Best

        
1 day later
#
On Sun, 17 Aug 2025 08:01:04 +0200
Kurt Hornik <Kurt.Hornik at wu.ac.at> wrote:

            
Thank you very much, this fixes my problem!
'locale -a' says that C.UTF-8 is available with glibc and musl on
Linux, also FreeBSD and OpenBSD, but not macOS (and setlocale(LC_ALL,
"C.UTF-8") indeed fails on the latter).