Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.
Before this patch, the utf8clen function would return either:
* 1 if `c' was an ascii character or a utf8 continuation byte
* An int in the range [2, 6] indicating the byte length of the utf8
character
With this patch, the utf8clen function will now return either:
* -1 if `c' is not a valid utf8 start byte
* The byte length of the utf8 character (the number of leading 1's,
really)
I believe returning -1 for continuation bytes makes utf8clen less error
prone.
The utf8_table4 array is no longer needed and has been removed.
Sahil
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.diff
Type: text/x-patch
Size: 1709 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170318/e8e82a14/attachment.bin>
[PATCH] Improve utf8clen and remove utf8_table4
3 messages · Duncan Murdoch, Sahil Kang
On 19/03/2017 2:31 AM, Sahil Kang wrote:
Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.
Before this patch, the utf8clen function would return either:
* 1 if `c' was an ascii character or a utf8 continuation byte
* An int in the range [2, 6] indicating the byte length of the utf8
character
With this patch, the utf8clen function will now return either:
* -1 if `c' is not a valid utf8 start byte
* The byte length of the utf8 character (the number of leading 1's,
really)
I believe returning -1 for continuation bytes makes utf8clen less error
prone.
The utf8_table4 array is no longer needed and has been removed.
utf8clen is used internally by R in more than a dozen places, and is likely used in packages as well. Have you checked that this change in semantics won't break any of those uses? Duncan Murdoch
Some of the code that uses utf8clen checks the validity of the utf8
string before making the call.
However, there were some hairy areas where I felt that the new semantics
may cause issues (if not now, then in future changes).
I've attached two patches:
* new_semantics.diff keeps the new semantics and updates those
hairy areas above.
* old_semantics.diff maintains the old semantics (return 1 even for
continuation bytes).
I don't think the new semantics will cause issues, especially with the
updates, but we can err on the side of caution and keep the old
semantics. I feel that the new semantics provide a clearer interface
though (the function expects a start byte and should return an error if
a start byte is not supplied).
In either case, the utf8_table4 array has been removed.
Sahil
On 03/19/2017 05:38 AM, Duncan Murdoch wrote:
On 19/03/2017 2:31 AM, Sahil Kang wrote:
Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.
Before this patch, the utf8clen function would return either:
* 1 if `c' was an ascii character or a utf8 continuation byte
* An int in the range [2, 6] indicating the byte length of the utf8
character
With this patch, the utf8clen function will now return either:
* -1 if `c' is not a valid utf8 start byte
* The byte length of the utf8 character (the number of leading 1's,
really)
I believe returning -1 for continuation bytes makes utf8clen less error
prone.
The utf8_table4 array is no longer needed and has been removed.
utf8clen is used internally by R in more than a dozen places, and is likely used in packages as well. Have you checked that this change in semantics won't break any of those uses? Duncan Murdoch
-------------- next part -------------- A non-text attachment was scrubbed... Name: old_semantics.diff Type: text/x-patch Size: 1707 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170319/477b61e9/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: new_semantics.diff Type: text/x-patch Size: 6332 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170319/477b61e9/attachment-0001.bin>