问题
Can anyone please explain me why the NFD normalization from U+2126 (Ω) and U+03A9 (Ω) results in the same representation and does not preserve the code point? I would have expected this behaviour for NFKD and NFKC (and for characters with diacritics) only.
result1 = unicodedata.normalize("NFD", u"\u2126")
result2 = unicodedata.normalize("NFD", u"\u03A9")
print("NFD: " + repr(result1))
print("NFD: " + repr(result2))
Output:
NFD: u'\u03a9'
NFD: u'\u03a9'
回答1:
These are known as "singleton decompositions", and exist for characters like U+2126 (Ω) that are present in Unicode for compatibility with existing standards. They are not "compatibility decompositions" (like U+1D6C0 𝛀) because they are both visually and semantically identical to another code point (in this case, U+03A9 Ω).
Because they essentially duplicate another code point, one is chosen as the "preferred form" and the other is always replaced by it when normalised (into any form). The first form is essentially deprecated.
来源:https://stackoverflow.com/questions/31899371/normalization-does-not-preserve-code-point