[R-pkg-devel] Warning... unable to translate 'Ekstr<f8>m' to a wide string; Error... input string 1 is invalid
On Tue, 19 Jul 2022 13:23:11 -0500
Spencer Graves <spencer.graves at effectivedefense.org> wrote:
So what's the recommended fix?
Is subNonStandardCharacters() supposed to work with strings with Encoding(.) == 'unknown' that are also invalid in current locale encoding? (I think it's fair to not support Encoding(.) == 'bytes' for such a function, because such strings aren't supposed to be text.) If yes, the function itself needs to be fixed. I think that useBytes=TRUE may help, as long as the standardCharacters argument is limited to characters representable in ASCII. Alternatively, find a way to transform the 'x' argument into something that is guaranteed to be valid in its declared encoding. enc2utf8() could be an option, but any invalid bytes are replaced by their <hexadecimal codes>, which defeats the purpose of subNonStandardCharacters(). Find a way to feed the output of Encoding(x) to iconv() as its "from" argument? If not, it's enough to fix the example.
If I understand correctly, "\u**" should work with ** being
f8, f6, df, or fc [all hex digits, I assume?]. However, "\u00**" may
be preferred over "\u**", and "\u{**}" may be better still.
This is described in ?Quotes, although admittedly harder to find than
desired. The "\u" escape sequences take 1 to 4 hexadecimal digits. As
long as your escape sequence isn't followed by something that looks
like a hexadecimal digit, you can keep it short, like "\uf8m" (m is not
a hex digit). If you want to be 100% unambiguous, either padding the
code point number to 4 digits ("\u00f8m") or wrapping it into braces
("\u{f8}m") is enough. The belt-and-bracers approach ("\u{00f8}m") is
not an error, either.
You can also use the Encoding(x) <- 'latin1' trick to mark the strings
produced from bytes as Latin-1. Then gsub() will work normally, the
same way things happily work in example(iconv).
Best regards, Ivan