What is the best way to remove accents (normalize) in a Python unicode string?

后端 未结 8 1738
感情败类
感情败类 2020-11-21 06:11

I have a Unicode string in Python, and I would like to remove all the accents (diacritics).

I found on the web an elegant way to do this (in Java):

  1. conve
8条回答
  •  梦毁少年i
    2020-11-21 06:46

    How about this:

    import unicodedata
    def strip_accents(s):
       return ''.join(c for c in unicodedata.normalize('NFD', s)
                      if unicodedata.category(c) != 'Mn')
    

    This works on greek letters, too:

    >>> strip_accents(u"A \u00c0 \u0394 \u038E")
    u'A A \u0394 \u03a5'
    >>> 
    

    The character category "Mn" stands for Nonspacing_Mark, which is similar to unicodedata.combining in MiniQuark's answer (I didn't think of unicodedata.combining, but it is probably the better solution, because it's more explicit).

    And keep in mind, these manipulations may significantly alter the meaning of the text. Accents, Umlauts etc. are not "decoration".

提交回复
热议问题