Two seemingly identical unicode strings turn out to be different when using repr(), but how can I fix this?

后端 未结 2 2002
南笙
南笙 2021-01-03 11:25

I have two lists of unicode strings, one containing words picked up from a text file, another containing a list of sound file names from a directory, stripped from their ext

2条回答
  •  醉酒成梦
    2021-01-03 11:52

    The problem seems to be in an ambiguous representation of grave accents in unicode. Here is LATIN SMALL LETTER A WITH GRAVE and here is COMBINING GRAVE ACCENT which when combined with 'a' becomes more or less the exact same character as the first. So two representations of the same character. In fact unicode has a term for this: unicode equivalence.

    To implement this in python, use unicodedata.normalize on the string before comparing. I tried 'NFC' mode which returns u'ch\xe0o' for both strings.

提交回复
热议问题