Normalizing unicode text to filenames, etc. in Python

前端 未结 5 1082
情书的邮戳
情书的邮戳 2021-02-01 05:42

Are there any standalonenish solutions for normalizing international unicode text to safe ids and filenames in Python?

E.g. turn My International Text: åäö

5条回答
  •  野性不改
    2021-02-01 06:39

    The following will remove accents from whatever characters Unicode can decompose into combining pairs, discard any weird characters it can't, and nuke whitespace:

    # encoding: utf-8
    from unicodedata import normalize
    import re
    
    original = u'ľ š č ť ž ý á í é'
    decomposed = normalize("NFKD", original)
    no_accent = ''.join(c for c in decomposed if ord(c)<0x7f)
    no_spaces = re.sub(r'\s', '_', no_accent)
    
    print no_spaces
    # output: l_s_c_t_z_y_a_i_e
    

    It doesn't try to get rid of characters disallowed on filesystems, but you can steal DANGEROUS_CHARS_REGEX from the file you linked for that.

提交回复
热议问题