发表新帖

发表新帖

Normalizing unicode text to filenames, etc. in Python

前端未结

关注

 5  1114

情书的邮戳 2021-02-01 05:42

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö

5条回答

被撕碎了的回忆 (楼主)

2021-02-01 06:24
I'll throw my own (partial) solution here too:
```
import unicodedata

def deaccent(some_unicode_string):
    return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
               if unicodedata.category(c) != 'Mn')
```
This does not do all you want, but gives a few nice tricks wrapped up in a convenience method: unicode.normalise('NFD', some_unicode_string) does a decomposition of unicode characters, for example, it breaks 'ä' to two unicode codepoints U+03B3 and U+0308.

The other method, unicodedata.category(char), returns the enicode character category for that particular char. Category Mn contains all combining accents, thus deaccent removes all accents from the words.

But note, that this is just a partial solution, it gets rid of accents. You still need some sort of whitelist of characters you want to allow after this.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题