Skip to content
Prev 53447 / 63424 Next

source(), parse(), and foreign UTF-8 characters

On 09/05/2017 3:42 AM, Kirill M?ller wrote:
Those are good long term goals, though I think the effort is easier than 
you think.  Rather than attempting to do it all at once, you should look 
for ways to do it gradually and submit self-contained patches.  In many 
cases it doesn't matter if strings are left in the local encoding, 
because the encoding doesn't matter.  The problems arise when UTF-8 
strings are converted to the local encoding before it's necessary, 
because that's a lossy conversion.  So a simple way to proceed is to 
identify where these conversions occur, and remove them one-by-one.

Currently I'm working on bug 16098, "Windows doesn't handle high Unicode 
code points".  It doesn't require many changes at all to handle input of 
those characters; all the remaining issues are avoiding the problems you 
identify above.  The origin of the issue is the fact that in Windows 
wchar_t is only 16 bits (not big enough to hold all Unicode code 
points).  As far as I know, Windows has no standard type to hold a 
Unicode code point, most of the run-time functions still use the 16 bit 
wchar_t.

I think once that bug is dealt with, 90+% of the remaining issues could 
be solved by avoiding translateChar on Windows.  This could be done by 
avoiding it everywhere, or by acting as though Windows is running in a 
UTF-8 locale until you actually need to write to a file.  Other systems 
tend to have UTF-8 locales in common use, so they're already fine.

You offered to spend time on this.  I'd appreciate some checks of the 
patch I'm developing for 16098, and also some research into how certain 
things (e.g. the iswprint function) are handled on Windows.

Duncan Murdoch