I have a file that includes Japanese characters encoded using the "JIS_X0208-1997" encoding. According to iconvlist(), an earlier revision "JIS_X0208-1990" is supported, so I'd like to try that to decode them. However, I can't seem to find how to provide input to iconv() to do it. This is a two-byte encoding, so one character has bytes > as.raw(result[[1]]$kanji) [1] b0 a1 But this is being interpreted as two characters by iconv(): > iconv(as.raw(result[[1]]$kanji), from = "JIS_X0208-1990", to = "UTF-8") [1] "?" "?" I can't seem to find any input that iconv() will accept to treat this as a single character. (I believe the answer should be ? , if that helps.) How do I tell it to use 0xb0a1 (or 0xa1b0, if that's the right byte order)? I just see NA: > iconv(0xb0a1, from = "JIS_X0208-1990", to = "UTF-8") [1] NA > iconv(0xa1b0, from = "JIS_X0208-1990", to = "UTF-8") [1] NA Duncan Murdoch
Converting two byte encoding to UTF-8
2 messages · Duncan Murdoch
I have solved it! First, the bytes I have are offset by 0x80 from what they should contain. The actual encoding of ? is 0x30 0x21. But subtracting 0x80 isn't enough; they are still treated as two characters: > iconv(as.raw(result[[1]]$kanji-0x80), from = "JIS_X0208-1990", to="UTF-8") [1] "?" "?" However, if I put those bytes in a list entry, it works: > iconv(list(as.raw(result[[1]]$kanji-0x80)), from = "JIS_X0208-1990", to="UTF-8") [1] "?" Duncan Murdoch
On 19/03/2022 6:52 a.m., Duncan Murdoch wrote:
I have a file that includes Japanese characters encoded using the "JIS_X0208-1997" encoding. According to iconvlist(), an earlier revision "JIS_X0208-1990" is supported, so I'd like to try that to decode them. However, I can't seem to find how to provide input to iconv() to do it. This is a two-byte encoding, so one character has bytes
> as.raw(result[[1]]$kanji)
[1] b0 a1 But this is being interpreted as two characters by iconv():
> iconv(as.raw(result[[1]]$kanji), from = "JIS_X0208-1990", to = "UTF-8")
[1] "?" "?" I can't seem to find any input that iconv() will accept to treat this as a single character. (I believe the answer should be ? , if that helps.) How do I tell it to use 0xb0a1 (or 0xa1b0, if that's the right byte order)? I just see NA:
> iconv(0xb0a1, from = "JIS_X0208-1990", to = "UTF-8")
[1] NA
> iconv(0xa1b0, from = "JIS_X0208-1990", to = "UTF-8")
[1] NA Duncan Murdoch