Unicode subscripts and superscripts in identifiers, why does Python consider XU == Xᵘ == Xᵤ?

前端 未结 2 1276
刺人心
刺人心 2020-12-31 02:06

Python allows unicode identifiers. I defined Xᵘ = 42, expecting XU and Xᵤ to result in a NameError. But in reality, whe

2条回答
  •  旧巷少年郎
    2020-12-31 02:49

    Python converts all identifiers to their NFKC normal form; from the Identifiers section of the reference documentation:

    All identifiers are converted into the normal form NFKC while parsing; comparison of identifiers is based on NFKC.

    The NFKC form of both the super and subscript characters is the lowercase u:

    >>> import unicodedata
    >>> unicodedata.normalize('NFKC', 'Xᵘ Xᵤ')
    'Xu Xu'
    

    So in the end, all you have is a single identifier, Xu:

    >>> import dis
    >>> dis.dis(compile('Xᵘ = 42\nprint((Xu, Xᵘ, Xᵤ))', '', 'exec'))
      1           0 LOAD_CONST               0 (42)
                  2 STORE_NAME               0 (Xu)
    
      2           4 LOAD_NAME                1 (print)
                  6 LOAD_NAME                0 (Xu)
                  8 LOAD_NAME                0 (Xu)
                 10 LOAD_NAME                0 (Xu)
                 12 BUILD_TUPLE              3
                 14 CALL_FUNCTION            1
                 16 POP_TOP
                 18 LOAD_CONST               1 (None)
                 20 RETURN_VALUE
    

    The above disassembly of the compiled bytecode shows that the identifiers have been normalised during compilation; this happens during parsing, any identifiers are normalised when creating the AST (Abstract Parse Tree) which the compiler uses to produce bytecode.

    Identifiers are normalized to avoid many potential 'look-alike' bugs, where you'd otherwise could end up using both find() (using the U+FB01 LATIN SMALL LIGATURE FI character followed by the ASCII nd characters) and find() and wonder why your code has a bug.

提交回复
热议问题