Writing escaped unicode - R-help

Mon, Dec 10, 2012 8:46 PM #

I'd like to write unicode strings using the "\u" escape syntax.  According to the documentation, print.default or encodeString will escape unicode using the \u convention.  In practice, I can't make it work.

[1] "Unicode character: ?"

[1] "Unicode character: ?"

I want to write the string back out in the same escape formatting as I read it in.  This is because I'm interfacing with some Ruby code that requires unicode to be in this escaped format.

Thanks in advance!

Jan T. Kim

Tue, Dec 11, 2012 2:49 AM #

On Mon, Dec 10, 2012 at 11:46:40PM -0500, David Kulp wrote:

as I read the documentation, encodeString escapes control characters,
but not "unicode characters". The notion of a "unicode character" is
not entirely well defined, considering that the very mission of the
unicode consortium is to make sure that there are no non-unicode
characters...  ;-)

representation, e.g. by

    paste(sprintf("\\u%04x", utf8ToInt(b)), collapse = "");

should work with the Ruby client you try to talk to. Obviously, this
bloats the string rather more than necessary (particularly if most of
the characters are in the ASCII range), but if the volume you're
piping into the client is small, this may be good enough.

Best regards, Jan

+- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

Duncan Murdoch

Tue, Dec 11, 2012 4:24 AM #

On 12-12-11 5:49 AM, Jan T Kim wrote:> On Mon, Dec 10, 2012 at

11:46:40PM -0500, David Kulp wrote:

>> I'd like to write unicode strings using the "\u" escape syntax. 
According to the documentation, print.default or encodeString will 
escape unicode using the \u convention.  In practice, I can't make it work.
 >>
 >>> b="Unicode character: \ufffd"
 >>> print.default(b)
 >> [1] "Unicode character: ???"
 >>> encodeString(b)
 >> [1] "Unicode character: ???"
 >>
 >> I want to write the string back out in the same escape formatting as 
I read it in.  This is because I'm interfacing with some Ruby code that 
requires unicode to be in this escaped format.
 >
 > as I read the documentation, encodeString escapes control characters,
 > but not "unicode characters". The notion of a "unicode character" is
 > not entirely well defined, considering that the very mission of the
 > unicode consortium is to make sure that there are no non-unicode
 > characters...  ;-)
 >
 >>From this it follows that replacing all characters with their \uxxxx
 > representation, e.g. by
 >
 >      paste(sprintf("\\u%04x", utf8ToInt(b)), collapse = "");
 >
 > should work with the Ruby client you try to talk to. Obviously, this
 > bloats the string rather more than necessary (particularly if most of
 > the characters are in the ASCII range), but if the volume you're
 > piping into the client is small, this may be good enough.

It's not too hard to do this only for the ones that need escaping.  If 
you want to convert control characters, this works:

code <- utf8ToInt(b)
paste( ifelse(31 < code & code < 128, intToUtf8(code, multiple=TRUE),
                                       sprintf("\\u%04x", code)),
        collapse=TRUE)

(And David should remember to use cat() or similar to print it, or the 
backslashes in the strings will appear to be doubled.)

Duncan Murdoch