Skip to content

[Rcpp-devel] String encoding (UTF-8 conversion)

5 messages · Jeroen Ooms, Dirk Eddelbuettel, Romain Francois

#
I'm interfacing a c++ library which assumes strings are UTF-8. However
strings from R can have various encodings. It's not clear to me how I
need to account for that in Rcpp. For example:

// [[Rcpp::export]]
std::string echo(std::string src){
  return src;
}

This program does not work on windows for non-ascii strings:
[1] "? ????

In C programs I always use translateCharUTF8 on all input to make sure
it is UTF8 before I start working with it:

  translateCharUTF8(STRING_ELT(x, i));

Similarly on the output, I explicitly set the encoding to let R know
it this is UTF8:

  SET_STRING_ELT(out, 0, mkCharCE(olds, CE_UTF8));

This ensures that code works across platforms and locales. How do we
go about this in Rcpp?
#
On 11 December 2014 at 12:24, Jeroen Ooms wrote:
| I'm interfacing a c++ library which assumes strings are UTF-8. However
| strings from R can have various encodings. It's not clear to me how I
| need to account for that in Rcpp. For example:
| 
| // [[Rcpp::export]]
| std::string echo(std::string src){
|   return src;
| }
| 
| This program does not work on windows for non-ascii strings:
| 
| > test = "??"
| > echo(test)
| [1] "? ????
| 
| In C programs I always use translateCharUTF8 on all input to make sure
| it is UTF8 before I start working with it:
| 
|   translateCharUTF8(STRING_ELT(x, i));
| 
| Similarly on the output, I explicitly set the encoding to let R know
| it this is UTF8:
| 
|   SET_STRING_ELT(out, 0, mkCharCE(olds, CE_UTF8));
| 
| This ensures that code works across platforms and locales. How do we
| go about this in Rcpp?

Maybe the same way?  ;-) 

A valid C expression is almost always a valid C++ expression. I haven't
needed this.  But as I recall, Romain did work with wchar for some project so
he may have a hint or two for you.

Dirk
#
On Thu, Dec 11, 2014 at 2:16 PM, Dirk Eddelbuettel <edd at debian.org> wrote:
OK I am not completely back at the C api, and it still doesn't work
when called via Rcpp:

// [[Rcpp::export]]
SEXP echo(SEXP src){
  const char* utf8str = Rf_translateCharUTF8(Rf_asChar(src));
  SEXP out = PROTECT(Rf_allocVector(STRSXP, 1));
  SET_STRING_ELT(out, 0, Rf_mkCharCE(utf8str, CE_UTF8));
  UNPROTECT(1);
  return out;
}
[1] "??"
[1] "? ????"
[1] "? ????"

Also:
[1] "Z?rich"
[1] "Z??rich"

The same function works perfectly fine when invoked via .Call(). Does
Rcpp somehow override the CE or attempt to recode strings before
giving them back to R?
4 days later
#
On Thu, Dec 11, 2014 at 12:24 PM, Jeroen Ooms <jeroen.ooms at stat.ucla.edu> wrote:
Follow-up on this: from what I have found, there is currently no
string type that is unambiguous across platforms and locales (other
than the actual STRSXP). If the native locale uses UTF8 than all is
fine, but we can not assume that in R. Here is a little script that
illustrates the various combinations I tried and the results on
Windows: https://gist.github.com/jeroenooms/9edf97f873f17a4ce5d3.

Assuming that each of these cases are intended behavior, perhaps we
could introduce an additional string type e.g. Rcpp::UTF8String. The
mapping from STRSXP to Rcpp::UTF8String would use
translateCharUTF8(STRING_ELT(x, 0)) and the mapping Rcpp::UTF8String
back to STRSXP would use SET_STRING_ELT(out, 0, mkCharCE(olds,
CE_UTF8)). That would allow for defining c++ functions operating on
UTF8 strings which will work as expected across platforms and locales.
#
That is similar to a path i've followed in Rcpp11/Rcpp14.

What's really missing in R is api access to strings, e.g testing for equality of two CHARSXP, comparing them, ...

This causes all sorts of problems with dplyr. 

Romain