Two seemingly identical unicode strings turn out to be different when using repr(), but how can I fix this?

后端未结

关注

 2  2000

南笙 2021-01-03 11:25

I have two lists of unicode strings, one containing words picked up from a text file, another containing a list of sound file names from a directory, stripped from their ext

2条回答

谎友^ (楼主)

2021-01-03 11:59
Some Unicode characters can be specified different ways, as you've discovered, either as a single codepoint or as a regular codepoint plus a combining codepoint. The character \u0300 is a COMBINING GRAVE ACCENT, which adds an accent mark to the preceding character.

The process of fixing a string to a common representation is called normalization. You can use the unicodedata module to do this:
```
def n(str):
    return unicodedata.normalize('NFKC', str)

>>> n(u'ch\xe0o') == n(u'cha\u0300o')
True
```
0 讨论(0)

查看其它2个回答
发布评论:

提交评论
- 加载中...