Skip to content

Small encoding question

6 messages · Vincent Goulet, Kurt Hornik, Brian Ripley +1 more

#
Dear developeRs,

Compilation of the latest version (0.9-5) of my actuar package fails  
with r-release MacOS_X ix86 on CRAN; see

	http://www.R-project.org/nosvn/R.check/r-release-macosx-ix86/actuar-00check.html

All errors come from accented letters in comments in latin-1 encoded  
files (except hierarc.R which is in UTF-8, my bad). Encoding is  
declared as latin-1 in DESCRIPTION.

The package checks and compiles fine on Windows, Linux and,  
ironically, my MacOS X main development machine. I realize using non- 
ASCII characters in source files is not a good idea and I removed  
them, but I would appreciate any clue as to what went wrong with the  
compilation on CRAN.

FWIW,

 > sessionInfo()
R version 2.6.2 (2008-02-08)
i386-apple-darwin8.10.1

locale:
fr_CA.UTF-8/fr_CA.UTF-8/fr_CA.UTF-8/C/fr_CA.UTF-8/fr_CA.UTF-8

attached base packages:
[1] stats     utils     datasets  grDevices graphics  methods   base

other attached packages:
[1] CarbonEL_0.1-4

loaded via a namespace (and not attached):
[1] tools_2.6.2

Thanks in advance!

---
   Vincent Goulet, Associate Professor
   ?cole d'actuariat
   Universit? Laval, Qu?bec
   Vincent.Goulet at act.ulaval.ca   http://vgoulet.act.ulaval.ca
#
I assume that the MacOS X builds are done in a C locale?

Best
-k

        
#
On Feb 14, 2008, at 2:45 PM, Kurt Hornik wrote:

            
Yes - but isn't this very similar to the problem we have been talking  
about a while back? The check analyses were reporting an error  
although the code was fine (I think it boiled down to text connection  
I/O in the check scripts failing mysteriously due to the fact that it  
was using the wrong encoding) I'll have to check later today ...

Cheers,
S
#
On Thu, 14 Feb 2008, Simon Urbanek wrote:

            
That was my first thought, but it worked in a C locale for me, even on Mac 
OS X.  But then we know there are C locales and C locales ....

I think R-devel is somewhat less prone to such issues, and it was R-devel 
I checked.

  
    
#
I think I found the cause, but fixing it may be more complicated  
(other than a hot fix for this particular case).

What it boils down to is that the code for .check_package_code_syntax  
is trying to change the locale in a manner that doesn't work. In  
addition to that, the output of l10n_info() is wrong (for some  
definition of wrong), which complicates things even further.

To top it all, if run in a UTF-8 locale, everything is just fine -  
that's why the package will pass check on "regular" OS X, because  
UTF-8 locale is the default since Leopard.

.check_package_code_syntax() sees that the source requires Latin1, so  
it is checking whether the locale is utf-8, but it's not (because we  
force C) so it uses en_US. This may be the first problem, because  
en_US is not necessarily a latin1 locale at all (en_US.ISO8859-1 would  
be latin1 on OS X). However, the next problem is that l10n_info() is  
returning FALSE even for the (correct) latin1 locale and  
consequently(?) the reading fails.

ginaz:~$ echo 'Sys.getlocale(); l10n_info()'|LANG=en_US.ISO8859-1 R -- 
vanilla --slave
[1] "en_US.ISO8859-1/en_US.ISO8859-1/en_US.ISO8859-1/C/en_US.ISO8859-1/ 
en_US.ISO8859-1"
$MBCS
[1] FALSE

$`UTF-8`
[1] FALSE

$`Latin-1`
[1] FALSE

en_US.ISO8859-1 *is* a latin-1 locale ... I was looking hard and found  
no way how to link (installed) locales to encodings - there is no  
official mapping and POSIX allows arbitrary locales (and names) ..  
Hence all locale names are merely loose conventions... so I'm not sure  
how can R even make such a decision (other than parse the name?).

Anyway - a quick fix would be to force en_US.UTF-8  locale in that  
check for Mac OS X, but I think that doesn't fix the underlying  
problems ...

Cheers,
Simon
On Feb 14, 2008, at 3:09 PM, Simon Urbanek wrote:

            
#
Have you set R_ENCODING_LOCALES?  That's how you tell R what locale to use 
for latin1 and UTF-8 when checking.  Details in R-exts.texi.

As it works for me in 'C' on Leopard with R-devel without setting this, I 
can't reproduce the problem to check if setting works.

For l10n_info, it is asking the nl_langinfo system.  Looks like Darwin 
is using unusual charset names: it reports ISO8859-1 and we are 
looking for (the more correct) ISO-8859-1: I've 'hot fixed' that.
On Thu, 14 Feb 2008, Simon Urbanek wrote: