Skip to content
Prev 27142 / 63461 Next

encoding question again

Brian,
On Dec 29, 2007, at 12:28 PM, Prof Brian Ripley wrote:

            
I was thinking about this before, but I don't have a good solution.  
The problem is that there are many places that may be affected.  
Especially all callbacks assume UTF-8 and since in R they are passed  
as char * they cannot be flagged. It is unfortunate, because JGR  
actually facilitates the use of UTF-8 nicely (e.g. you can create  
Japanese annotated plots regardless of the Windows locale), but it  
cannot pass that ability to R (except silently and sort of  
incorrectly). It is, however, surprising how far you can get despite  
this conflict (basically it works nicely as long as you don't talk to  
the system). Once we force some conversion on callbacks, we lose that  
advantage, so I'm still not sure what's the best solution. One semi- 
fix would be to take care of the latin1 locales and perform all  
conversions there, because they are so limited anyway, that users  
working in latin1 locales don't expect anything fancy to work anyway :).
I agree. On the other hand, ideally there should be very little direct  
I/O in packages and even if it doesn't work in UTF-8, it won't make it  
unusable, just limited.

Most projects adopted UTF-8 or unicode as the native encoding. I think  
we are on the right track (strings flagged with known encoding) and in  
the end we may end up using let's say UTF-8 internally and convert  
only for system calls.
We may also end up supporting a similar concept (string+encoding) on  
the "edges" sooner or later: something like  
WriteConsoleWithEncoding(...) which could flag if possible instead of  
converting. Given that the embedding API needs some more  
consolidation, it may be a good time to tackle this as well. I'm  
hoping to do some cleanup and propose something as a part of the new  
ObjC API for R 2.7 and Mac GUI 2.0, so any input is welcome.

Thanks,
Simon