Skip to content

Non-ASCII chars in R code

2 messages · Brian Ripley

#
The report on R_help about problems loading package irr (in a 
UTF-8 locale, it seemed) prompted me to look a little deeper.  There are 
quite a few packages with Latin-1 chars in their .R files, and a couple in 
UTF-8.

Apart from non-ASCII chars in comments, this is a problem as the code 
concerned cannot be represented in some locales R runs in (for example 
Japanese on Windows).  It happens that irr is so small that lazy-loading 
is not used, but when lazy-loading or a saved image is used, the locale in 
use when the package is installed determines how the code is parsed (and 
may not be the same as when the package is used, and indeed it is not 
uncommon on Linux/Unix systems for different users to use different 
locales).

This means that using non-ASCII chars is not portable, and I've added code 
to R CMD check in R-devel to warn about such usage.  In the examples I 
have investigated the usages have been

- messages in a non-English language, typically French.
- startup messages with people's names.
- use of characters that I can only guess are intended to be in the
   WinAnsi encoding, e.g. a copyright symbol.

The only reason I have not made this an error is that people might want to 
produce packages for a known locale, e.g. a student class, but perhaps it 
should be an error for packages submitted to CRAN.

I do not believe there is much we can do about this: messages which are 
not entirely in ASCII cannot be displayed on many R platforms and it seems 
incorrect to allow French messages and not Japanese ones.

The packages currently throwing warnings are

FactoMineR FunCluster JointGLM LoopAnalyst Sciviews ade4 adehabitat ape 
climatol crossdes deal grasper irr lsa mvrpart pastecs sn surveillance 
truncgof
1 day later
#
A little more digging revealed a Unix/Windows discrepancy here.

On Unix, saving images and preparing for lazyloading/lazydata is done with 
LC_ALL=C: on Windows with LC_COLLATE=C.  I will change Windows to match.

Unfortunately how the C locale is implemented is OS-dependent.  Strictly 
it should not allow bytes 0x80 to 0xff but it does on some OSes (including 
Windows).  So the strict consequences of this should be that when using
lazy-loading or a saved image

- all names have to be ASCII alphanumeric
- \uxxxx sequences are not allowed except \u007f and lower (they are not
   valid at all in a C locale prior to 2.3.1 so I would not expect to see
   them in a package).
- bytes in character strings are copied byte for byte.

This leaves an inconsistency between packages which use lazy-loading / 
save image and those which do not.  We could resolve that by switching to 
the C locale when loading R code in packages (or, better, R code that was 
not a loader stub): I didn't think that would be worthwhile but in fact 5 
of the packages listed are small enough not to be lazy-loaded.

The other consequence is that the only way we allow packages to have 
object names which are not ASCII alphanumeric is to disable lazy loading.
One possibility is to allow a package to specify its required locale for 
loading in the DESCRIPTION file, and make use of that.

I am inclined to do nothing about these issues unless people have an 
actual need to have packages tailored on a non-English locale.
On Wed, 17 May 2006, Prof Brian Ripley wrote: