Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

偶尔善良 提交于 2020-06-10 08:33:28

问题


Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, when I define Xᵘ, Python (silently?) turns Xᵘ into Xu, which strikes me as somewhat of an unpythonic thing to do. Why is this happening?

>>> Xᵘ = 42
>>> print((Xu, Xᵘ, Xᵤ))
(42, 42, 42)

回答1:


Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:

All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

The NFKC form of both the super and subscript characters is the lowercase u:

>>> import unicodedata
>>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
'Xu Xu'

So in the end, all you have is a single identifier, Xu:

>>> import dis
>>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
  1           0 LOAD_CONST               0 (42)
              2 STORE_NAME               0 (Xu)

  2           4 LOAD_NAME                1 (print)
              6 LOAD_NAME                0 (Xu)
              8 LOAD_NAME                0 (Xu)
             10 LOAD_NAME                0 (Xu)
             12 BUILD_TUPLE              3
             14 CALL_FUNCTION            1
             16 POP_TOP
             18 LOAD_CONST               1 (None)
             20 RETURN_VALUE

The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.

Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both find() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.




回答2:


Python, as of version 3.0, supports non-ASCII identifiers. When parsing the identifiers are converted using NFKC normalization and any identifiers where the normalized value is the same are considered the same identifier.

See PEP 3131 for more details. https://www.python.org/dev/peps/pep-3131/



来源:https://stackoverflow.com/questions/48404881/unicode-subscripts-and-superscripts-in-identifiers-why-does-python-consider-xu

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!