Skip to content
Prev 388453 / 398513 Next

Plotting the ASCII character set.

On Sun, 4 Jul 2021 13:59:49 +1200
Rolf Turner <r.turner at auckland.ac.nz> wrote:

            
Interesting. I didn't pay attention to it at first, but now I see that
a range of code points, U+0080 to U+009F, corresponds to control
characters (also, 0+00A0 is non-breakable space), not anything
printable. Also, Latin-1 doesn't define any meaning for bytes
0x80..0x9f, but here they are decoded to same-valued Unicode code
points. And the actual code point for ? is U+20AC, not even close to
what we're working with.
You are right. I didn't know that, but my reading of the function
translateToNative in src/main/sysutils.c suggests that R decodes
strings marked as 'latin1' as Windows-1252 (if it's available for the
system iconv()) and uses the actual Latin-1 as a fallback.

?Encoding does warn that 'latin1' is ambiguous and system-dependent
with regards to bytes 0x80..0x9f, so text() seems to be right to use
Latin-1 and not Windows-1252 when trying to plot byte 0x80 encoded as
CE_LATIN1 as U+0080. Although there's a /* FIXME: allow CP1252? */
comment in src/main/sysutils.c, function reEnc, which is used by text().
I think that iconv(a, 'CP1252', '', '\ufffd') should work for you. At
least it seems to work for the ? sign. It does leave the following
bytes undefined, represented as ? U+FFFD REPLACEMENT CHARACTER:

as.raw(which(is.na(
 iconv(sapply(as.raw(1:255), rawToChar), 'CP1252', '')
)))
# [1] 81 8d 8f 90 9d

Not sure what can be done about those. With Latin-1, they would
correspond to unprintable control characters anyway.