发表新帖

发表新帖

Normalizing unicode text to filenames, etc. in Python

前端未结

关注

 5  1082

情书的邮戳 2021-02-01 05:42

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö

5条回答

野性不改 (楼主)

2021-02-01 06:39
The following will remove accents from whatever characters Unicode can decompose into combining pairs, discard any weird characters it can't, and nuke whitespace:
```
# encoding: utf-8
from unicodedata import normalize
import re

original = u'ľ š č ť ž ý á í é'
decomposed = normalize("NFKD", original)
no_accent = ''.join(c for c in decomposed if ord(c)<0x7f)
no_spaces = re.sub(r'\s', '_', no_accent)

print no_spaces
# output: l_s_c_t_z_y_a_i_e
```
It doesn't try to get rid of characters disallowed on filesystems, but you can steal DANGEROUS_CHARS_REGEX from the file you linked for that.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题