Why doesn't unicodedata recognise certain characters?

后端 未结 1 1422
陌清茗
陌清茗 2020-12-30 04:33

In Python 2 at least, unicodedata.name() doesn\'t recognise certain characters.

ActivePython 2.7.0.2 (ActiveState Software Inc.) based on
Python         


        
相关标签:
1条回答
  • 2020-12-30 05:07

    The unicodedata.name() lookup relies on column 2 of the UnicodeData.txt database in the standard (Python 2.7 uses Unicode 5.2.0).

    If that name starts with < it is ignored. All control codes, including newlines, are in that category; the first column has no name other than <control>:

    000A;<control>;Cc;0;B;;;;;N;LINE FEED (LF);;;;
    

    Column 10 is the old, Unicode 1.0 name, and should not be used, according to the standard. In other words, \n has no name, other than the generic <control>, which the Python database ignores (as it is not unique).

    Python 3.3 added support for NameAliases.txt, which lets you look up names by alias; so lookup('LINE FEED'), lookup('new line') or lookup('eol'), etc, all reference \n. However, the unicodedata.name() method does not support aliases, nor could it (which would it pick?):

    • Added support for Unicode name aliases and named sequences. Both unicodedata.lookup() and '\N{...}' now resolve name aliases, and unicodedata.lookup() resolves named sequences too.

    TL;DR: LINE FEED is not the official name for \n, it is but an alias for it. Python 3.3 and up let you look up characters by alias.

    0 讨论(0)
提交回复
热议问题