[PATCH] Improve utf8clen and remove utf8_table4 - R-devel

Sat, Mar 18, 2017 11:31 PM #

Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.

Before this patch, the utf8clen function would return either:
     * 1 if `c' was an ascii character or a utf8 continuation byte
     * An int in the range [2, 6] indicating the byte length of the utf8 
character

With this patch, the utf8clen function will now return either:
     * -1 if `c' is not a valid utf8 start byte
     * The byte length of the utf8 character (the number of leading 1's, 
really)

I believe returning -1 for continuation bytes makes utf8clen less error 
prone.
The utf8_table4 array is no longer needed and has been removed.

Sahil
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.diff
Type: text/x-patch
Size: 1709 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170318/e8e82a14/attachment.bin>

Duncan Murdoch

Sun, Mar 19, 2017 5:38 AM #

On 19/03/2017 2:31 AM, Sahil Kang wrote:

utf8clen is used internally by R in more than a dozen places, and is 
likely used in packages as well.  Have you checked that this change in 
semantics won't break any of those uses?

Duncan Murdoch

Sahil Kang

Sun, Mar 19, 2017 1:24 PM #

Some of the code that uses utf8clen checks the validity of the utf8 
string before making the call.
However, there were some hairy areas where I felt that the new semantics 
may cause issues (if not now, then in future changes).

I've attached two patches:
     * new_semantics.diff keeps the new semantics and updates those 
hairy areas above.
     * old_semantics.diff maintains the old semantics (return 1 even for 
continuation bytes).

I don't think the new semantics will cause issues, especially with the 
updates, but we can err on the side of caution and keep the old 
semantics. I feel that the new semantics provide a clearer interface 
though (the function expects a start byte and should return an error if 
a start byte is not supplied).
In either case, the utf8_table4 array has been removed.

Sahil

On 03/19/2017 05:38 AM, Duncan Murdoch wrote:

-------------- next part --------------
A non-text attachment was scrubbed...
Name: old_semantics.diff
Type: text/x-patch
Size: 1707 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170319/477b61e9/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: new_semantics.diff
Type: text/x-patch
Size: 6332 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170319/477b61e9/attachment-0001.bin>