When I send some outlandish characters through enc2native (or format) in
R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8"
In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1"
And this is wrong. The native character set of a unicode application
under Windows is *Unicode*. enc2native should do the same under Windows
as it does on Ubuntu. Also the "unknown" encoding should be changed to
mean the same as "UTF-8" exactly as it is on Linux.
Native characterset is wrong for unicode builds for Windows
9 messages · Duncan Murdoch, Winston Chang, maillist at tlink.de
On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one? R is being told by Windows that your native encoding is latin1. Perhaps Windows 8 supports UTF-8 as a native encoding (I've never used it), but previous versions of Windows didn't. Duncan Murdoch
On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de <maillist at tlink.de> wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
"?????"
[1] "?????"
enc2native("?????")
[1] "?????"
Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
"?????"
[1] "?????"
enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
I think you're mixing up the term "character set" with the encoding for a character set. Unicode is a character set. UTF-8 is one of many encodings of Unicode. UTF-8 may be the native character encoding in Ubuntu, but it's not the native encoding in Windows. According to this, what counts as the native encoding in Windows depends on the code page: http://stackoverflow.com/a/4649507 So you shouldn't expect enc2native to do the same thing on Linux and Windows. If you really want UTF-8, you can use enc2utf8. -Winston
On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one? R is being told by Windows that your native encoding is latin1. Perhaps Windows 8 supports UTF-8 as a native encoding (I've never used it), but previous versions of Windows didn't. Duncan Murdoch
A unicode application is a program that uses the unicode API of Windows - the functions with the ending W. For such a application the system code page (native encoding) is completely irrelevant. The system code page is just a compatibility feature that enables Windows NT/Vista/7/8 to run applications that were developed for Windows 95 which didn't have unicode support. But this line of operating systems is dead for 10 years now. R obviously is a unicode application because it can print - or read from the clipboard - characters like "???" that are not in my system code page which is not possible over the legacy API. Neither the unicode API nor the legacy API accepts UTF-8. The legacy API needs strings encoded according to the active code page and the unicode API wants UTF-16. If you have UTF-8 you need to convert it in either to the active code page which will loose all characters that are not covered by it or convert to UTF-16 and use the unicode functions. But this is not the problem, the Windows interface functions of R are working quite nicely with unicode already.
Am 26.02.2015 um 23:44 schrieb Winston Chang:
On Thu, Feb 26, 2015 at 2:09 PM, maillist at tlink.de
<mailto:maillist at tlink.de> <maillist at tlink.de
<mailto:maillist at tlink.de>> wrote:
When I send some outlandish characters through enc2native (or
format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8"
In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1"
And this is wrong. The native character set of a unicode
application under Windows is *Unicode*. enc2native should do the
same under Windows as it does on Ubuntu. Also the "unknown"
encoding should be changed to mean the same as "UTF-8" exactly as
it is on Linux.
I think you're mixing up the term "character set" with the encoding
for a character set. Unicode is a character set. UTF-8 is one of many
encodings of Unicode.
UTF-8 may be the native character encoding in Ubuntu, but it's not the
native encoding in Windows. According to this, what counts as the
native encoding in Windows depends on the code page:
http://stackoverflow.com/a/4649507
So you shouldn't expect enc2native to do the same thing on Linux and
Windows. If you really want UTF-8, you can use enc2utf8.
-Winston
Maybe I'm expecting too much but I rather have R not to produce garbage like "??<U+0394><U+040A><U+05EA>" and while I can use enc2utf8 to convert from UTF-8 to UTF-8 this does not fix the many places - like "format" - where enc2native is used and that are broken because of this.
On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one? R is being told by Windows that your native encoding is latin1. Perhaps Windows 8 supports UTF-8 as a native encoding (I've never used it), but previous versions of Windows didn't. Duncan Murdoch
A unicode application is a program that uses the unicode API of Windows
R uses those functions, so I guess it is a "unicode application". But internally it uses an 8 bit encoding (normally the native one for the platform it is running on, which in your case is apparently latin1).
- the functions with the ending W. For such a application the system code page (native encoding) is completely irrelevant. The system code page is just a compatibility feature that enables Windows NT/Vista/7/8 to run applications that were developed for Windows 95 which didn't have unicode support.
Windows 95 had UCS-2 support, which was pretty close to UTF-16. But this line of operating systems is dead for 10 years
now. R obviously is a unicode application because it can print - or read from the clipboard - characters like "???" that are not in my system code page which is not possible over the legacy API.
So "unicode application" is something you just made up. If you use Windows development tools, they have macros to convert generic functions to either A or W versions. R doesn't use those. It calls the W functions when it has UTF-16 characters, and A functions when it has native characters. I would love it if R was a UTF-8 application, because it would make life so much simpler, but Windows doesn't support that. So R needs to do tons of conversions. If you don't like that, you probably need to stick with Ubuntu. Duncan Murdoch
Neither the unicode API nor the legacy API accepts UTF-8. The legacy API needs strings encoded according to the active code page and the unicode API wants UTF-16. If you have UTF-8 you need to convert it in either to the active code page which will loose all characters that are not covered by it or convert to UTF-16 and use the unicode functions. But this is not the problem, the Windows interface functions of R are working quite nicely with unicode already.
______________________________________________ R-devel at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one? R is being told by Windows that your native encoding is latin1. Perhaps Windows 8 supports UTF-8 as a native encoding (I've never used it), but previous versions of Windows didn't. Duncan Murdoch
A unicode application is a program that uses the unicode API of Windows
R uses those functions, so I guess it is a "unicode application". But internally it uses an 8 bit encoding (normally the native one for the platform it is running on, which in your case is apparently latin1).
- the functions with the ending W. For such a application the system code page (native encoding) is completely irrelevant. The system code page is just a compatibility feature that enables Windows NT/Vista/7/8 to run applications that were developed for Windows 95 which didn't have unicode support.
Windows 95 had UCS-2 support, which was pretty close to UTF-16. But this line of operating systems is dead for 10 years
now. R obviously is a unicode application because it can print - or read from the clipboard - characters like "???" that are not in my system code page which is not possible over the legacy API.
So "unicode application" is something you just made up. If you use Windows development tools, they have macros to convert generic functions to either A or W versions. R doesn't use those. It calls the W functions when it has UTF-16 characters, and A functions when it has native characters. I would love it if R was a UTF-8 application, because it would make life so much simpler, but Windows doesn't support that. So R needs to do tons of conversions. If you don't like that, you probably need to stick with Ubuntu. Duncan Murdoch
I am not complaining about those conversions. They work just fine
already. I am complaining about
enc2native breaking things in the windows builds. An assignment like
s <- format("?????")
has no interaction with windows at all yet "s" contains garbage like
"??<U+0394><U+040A><U+05EA>"
after that. And if a native encoding of UTF-8 - as defined by enc2native
- works in Ubuntu why shouldn't it work
in Windows?
On 27/02/2015 2:31 AM, maillist at tlink.de wrote:
Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one? R is being told by Windows that your native encoding is latin1. Perhaps Windows 8 supports UTF-8 as a native encoding (I've never used it), but previous versions of Windows didn't. Duncan Murdoch
A unicode application is a program that uses the unicode API of Windows
R uses those functions, so I guess it is a "unicode application". But internally it uses an 8 bit encoding (normally the native one for the platform it is running on, which in your case is apparently latin1).
- the functions with the ending W. For such a application the system code page (native encoding) is completely irrelevant. The system code page is just a compatibility feature that enables Windows NT/Vista/7/8 to run applications that were developed for Windows 95 which didn't have unicode support.
Windows 95 had UCS-2 support, which was pretty close to UTF-16. But this line of operating systems is dead for 10 years
now. R obviously is a unicode application because it can print - or read from the clipboard - characters like "???" that are not in my system code page which is not possible over the legacy API.
So "unicode application" is something you just made up. If you use Windows development tools, they have macros to convert generic functions to either A or W versions. R doesn't use those. It calls the W functions when it has UTF-16 characters, and A functions when it has native characters. I would love it if R was a UTF-8 application, because it would make life so much simpler, but Windows doesn't support that. So R needs to do tons of conversions. If you don't like that, you probably need to stick with Ubuntu. Duncan Murdoch
I am not complaining about those conversions. They work just fine
already. I am complaining about
enc2native breaking things in the windows builds. An assignment like
s <- format("?????")
has no interaction with windows at all yet "s" contains garbage like
"??<U+0394><U+040A><U+05EA>"
after that. And if a native encoding of UTF-8 - as defined by enc2native
- works in Ubuntu why shouldn't it work
in Windows?
Because in Ubuntu, UTF-8 is the native encoding, and in your Windows system, latin1 is the native encoding. But I do agree that the format() issue is a problem. I haven't traced through the code, but I think the string "?????" is read using Windows API functions that return a UTF-16 result, then converted by R to UTF-8. So format() should see that it is a UTF-8 string and not convert it to the native encoding. There is nothing wrong with enc2native(), it's doing what you asked for. The problem is that format() is using it. Duncan Murdoch
Am 27.02.2015 um 11:49 schrieb Duncan Murdoch:
On 27/02/2015 2:31 AM, maillist at tlink.de wrote:
Am 27.02.2015 um 03:13 schrieb Duncan Murdoch:
On 26/02/2015 6:34 PM, maillist at tlink.de wrote:
On 26/02/2015 3:09 PM, maillist at tlink.de wrote:
When I send some outlandish characters through enc2native (or format) in R 3.1.2 on Ubuntu trusty it works quite well:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "?????"
> Encoding(enc2native("?????"))
[1] "UTF-8" In Windows the result is different:
> "?????"
[1] "?????"
> enc2native("?????")
[1] "??<U+0394><U+040A><U+05EA>"
> Encoding(enc2native("?????"))
[1] "latin1" And this is wrong. The native character set of a unicode application under Windows is *Unicode*. enc2native should do the same under Windows as it does on Ubuntu. Also the "unknown" encoding should be changed to mean the same as "UTF-8" exactly as it is on Linux.
What is a "unicode application", and what makes you think R is one? R is being told by Windows that your native encoding is latin1. Perhaps Windows 8 supports UTF-8 as a native encoding (I've never used it), but previous versions of Windows didn't. Duncan Murdoch
A unicode application is a program that uses the unicode API of Windows
R uses those functions, so I guess it is a "unicode application". But internally it uses an 8 bit encoding (normally the native one for the platform it is running on, which in your case is apparently latin1).
- the functions with the ending W. For such a application the system code page (native encoding) is completely irrelevant. The system code page is just a compatibility feature that enables Windows NT/Vista/7/8 to run applications that were developed for Windows 95 which didn't have unicode support.
Windows 95 had UCS-2 support, which was pretty close to UTF-16. But this line of operating systems is dead for 10 years
now. R obviously is a unicode application because it can print - or read from the clipboard - characters like "???" that are not in my system code page which is not possible over the legacy API.
So "unicode application" is something you just made up. If you use Windows development tools, they have macros to convert generic functions to either A or W versions. R doesn't use those. It calls the W functions when it has UTF-16 characters, and A functions when it has native characters. I would love it if R was a UTF-8 application, because it would make life so much simpler, but Windows doesn't support that. So R needs to do tons of conversions. If you don't like that, you probably need to stick with Ubuntu. Duncan Murdoch
I am not complaining about those conversions. They work just fine
already. I am complaining about
enc2native breaking things in the windows builds. An assignment like
s <- format("?????")
has no interaction with windows at all yet "s" contains garbage like
"??<U+0394><U+040A><U+05EA>"
after that. And if a native encoding of UTF-8 - as defined by enc2native
- works in Ubuntu why shouldn't it work
in Windows?
Because in Ubuntu, UTF-8 is the native encoding, and in your Windows system, latin1 is the native encoding. But I do agree that the format() issue is a problem. I haven't traced through the code, but I think the string "?????" is read using Windows API functions that return a UTF-16 result, then converted by R to UTF-8. So format() should see that it is a UTF-8 string and not convert it to the native encoding. There is nothing wrong with enc2native(), it's doing what you asked for. The problem is that format() is using it. Duncan Murdoch
I would expect that every function that is using enc2native is broken in
this respect because it invariably will scramble most unicode characters
in the process and I can't think of a case where this could be wanted
actually.
Functions that really need something other than UTF-8 are probably using
iconv and getOption("encoding") anyway as this allows to specify the
desired encoding much more flexible.