Two seemingly identical unicode strings turn out to be different when using repr(), but how can I fix this?

后端 未结 2 2000
南笙
南笙 2021-01-03 11:25

I have two lists of unicode strings, one containing words picked up from a text file, another containing a list of sound file names from a directory, stripped from their ext

2条回答
  •  谎友^
    谎友^ (楼主)
    2021-01-03 11:59

    Some Unicode characters can be specified different ways, as you've discovered, either as a single codepoint or as a regular codepoint plus a combining codepoint. The character \u0300 is a COMBINING GRAVE ACCENT, which adds an accent mark to the preceding character.

    The process of fixing a string to a common representation is called normalization. You can use the unicodedata module to do this:

    def n(str):
        return unicodedata.normalize('NFKC', str)
    
    >>> n(u'ch\xe0o') == n(u'cha\u0300o')
    True
    

提交回复
热议问题