Normalization does not preserve code point

爷,独闯天下 提交于 2020-01-03 02:21:12

问题


Can anyone please explain me why the NFD normalization from U+2126 (Ω) and U+03A9 (Ω) results in the same representation and does not preserve the code point? I would have expected this behaviour for NFKD and NFKC (and for characters with diacritics) only.

result1 = unicodedata.normalize("NFD", u"\u2126")
result2 = unicodedata.normalize("NFD", u"\u03A9")
print("NFD: " + repr(result1))
print("NFD: " + repr(result2))

Output:

NFD: u'\u03a9'
NFD: u'\u03a9'

回答1:


These are known as "singleton decompositions", and exist for characters like U+2126 (Ω) that are present in Unicode for compatibility with existing standards. They are not "compatibility decompositions" (like U+1D6C0 𝛀) because they are both visually and semantically identical to another code point (in this case, U+03A9 Ω).

Because they essentially duplicate another code point, one is chosen as the "preferred form" and the other is always replaced by it when normalised (into any form). The first form is essentially deprecated.



来源:https://stackoverflow.com/questions/31899371/normalization-does-not-preserve-code-point

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!