A question about the API mkchar() - R-devel

Fán Lóng · 2008-10-28T10:26:33Z

Hi guys, I've got a question about the API mkchar(). I have met some difficulty in parsing utf-8 string to mkchar() in R-2.7.0. I was intending to parse an utf-8 string str_jan (some Japanese characters such as?, whose utf-8 code is E381B5) to R API SEXP mkChar(const char *name) , we only need to create the SEXP using the string that we parsed. Unfortunately, I found when parsing the variable str_jan, R will automatically convert the str_jan according to the current locale setting, so on

Simon Urbanek

Tue, Oct 28, 2008 8:25 AM #

On Oct 28, 2008, at 6:26 , F?n L?ng wrote:

Hey guy :)

There is no mkchar() in R. Did you perhaps mean mkChar()?

There is no such "UTF-8" code. I'm not sure if you meant Unicode, but  
that would be \u3075 (Hiragana hu) for that character. The UTF-8  
encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 if  
that's what you meant.

That is not true - it will be kept as-is regardless of the encoding.  
Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); No  
conversion takes place when the string is created, but you have told R  
that it is in the native encoding. If that is not true (which is your  
case probably isn't), all bets are off since you're lying to R ;).

That is clearly a nonsense since the encoding has nothing to do with  
the locale language itself (Japanese, Chinese, ..). We are talking  
about the encoding (note that both English and Japanese locales can  
use UTF-8 encoding, but don't have to). I think you'll need to get the  
concepts right here - for each string you must define the encoding in  
order to be able to reproduce the unicode sequence that the string  
represents. At this point it has nothing to do with the language.

Well, that's exactly what you want, isn't it? The string is correctly  
flagged as UTF-8 so R is finally able to find out what exactly is  
represented by that string. However, your locale apparently doesn't  
support such characters so it cannot be displayed. If you use a locale  
that supports it, it works just fine, for example if you use local  
with SJIS encoding R will still know how to convert it from UTF-8 to  
SJIS *for display*. The actual string is not touched.

Here is a small piece of code that shows you the difference between  
native encoding and UTF8-strings:

#include <R.h>
#include <Rinternals.h>

SEXP me() {
   const char c[] = { 0xe3, 0x81, 0xb5, 0 };
   SEXP a = allocVector(STRSXP, 2);
   PROTECT(a);
   SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
   SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
   UNPROTECT(1);
   return a;
}

In a UTF-8 locale it doesn't matter:

ginaz:sandbox$ LANG=ja_JP.UTF-8 R
 > .Call("me")
[1] "?" "?"

But in any other, let's say SJIS, it does:

ginaz:sandbox$ LANG=ja_JP.SJIS R
 > .Call("me")
[1] "??" "?"

Note that the first string is wrong, because we have supplied UTF-8  
encoding but the current one is SJIS. The second one is correct since  
we told R that it's UTF-8 encoded.

Finally, if the character cannot be displayed in the given encoding:

ginaz:sandbox$ LANG=en_US.US-ASCII R
 > .Call("me")
[1] "\343\201\265" "<U+3075>"

The first one is wrong again, since it's not flagged as UTF8, but the  
second one is exactly as expected - unicode 3075 which is the Hiragana  
"hu". It doesn't exist in US-ASCII so unicode designation is all you  
can display.

mkChar(X, CE_UTF8);

Cheers,
Simon