Are uppercase utf8 characters always the same number of bytes as their lowercase variants?

这一生的挚爱 提交于 2019-12-12 08:35:20

问题


Obviously it is true for the latin alphabet. But I'm asking this in a conceptual sense, across languages and the Unicode spec.

Practically this came up for comparing two strings. If you already know they aren't the same number of bytes—across all languages—can you consider that enough of a guarantee that they are not differently "cased" versions of the same string?


回答1:


No.

Consider U+0069 "i" which has the octet value 69 in UTF-8. In the uppercase form U+0130 "İ" this code point forms the UTF-8 sequence C4 B0.

Obligatory note: case is locale-sensitive.




回答2:


There is no principle or invariant in the Unicode standard that guarantees this. I would be particularly concerned about accented capitals, where there may be a mismatch between precomposition and non-precomposition across cases. However, I can't cite an example of a problem for you.



来源:https://stackoverflow.com/questions/14792841/are-uppercase-utf8-characters-always-the-same-number-of-bytes-as-their-lowercase

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!