[PATCH] Improve utf8clen and remove utf8_table4
Some of the code that uses utf8clen checks the validity of the utf8
string before making the call.
However, there were some hairy areas where I felt that the new semantics
may cause issues (if not now, then in future changes).
I've attached two patches:
* new_semantics.diff keeps the new semantics and updates those
hairy areas above.
* old_semantics.diff maintains the old semantics (return 1 even for
continuation bytes).
I don't think the new semantics will cause issues, especially with the
updates, but we can err on the side of caution and keep the old
semantics. I feel that the new semantics provide a clearer interface
though (the function expects a start byte and should return an error if
a start byte is not supplied).
In either case, the utf8_table4 array has been removed.
Sahil
On 03/19/2017 05:38 AM, Duncan Murdoch wrote:
On 19/03/2017 2:31 AM, Sahil Kang wrote:
Given a char `c' which should be the start byte of a utf8 character,
the utf8clen function returns the byte length of the utf8 character.
Before this patch, the utf8clen function would return either:
* 1 if `c' was an ascii character or a utf8 continuation byte
* An int in the range [2, 6] indicating the byte length of the utf8
character
With this patch, the utf8clen function will now return either:
* -1 if `c' is not a valid utf8 start byte
* The byte length of the utf8 character (the number of leading 1's,
really)
I believe returning -1 for continuation bytes makes utf8clen less error
prone.
The utf8_table4 array is no longer needed and has been removed.
utf8clen is used internally by R in more than a dozen places, and is likely used in packages as well. Have you checked that this change in semantics won't break any of those uses? Duncan Murdoch
-------------- next part -------------- A non-text attachment was scrubbed... Name: old_semantics.diff Type: text/x-patch Size: 1707 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170319/477b61e9/attachment.bin> -------------- next part -------------- A non-text attachment was scrubbed... Name: new_semantics.diff Type: text/x-patch Size: 6332 bytes Desc: not available URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20170319/477b61e9/attachment-0001.bin>