How to convert unicode accented characters to pure ascii without accents?

后端 未结 3 1731
小鲜肉
小鲜肉 2020-12-13 21:58

I\'m trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t

The problem I\'m having is that the original paragr

相关标签:
3条回答
  • 2020-12-13 22:20

    how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?

    Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

    import unicodedata
    output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
    

    Explicit example...

    >>> myfoo = u'àà'
    >>> myfoo
    u'\xe0\xe0'
    >>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
    'aa'
    >>>
    

    How it works
    unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).

    0 讨论(0)
  • 2020-12-13 22:20

    I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:

    # ~*~ coding: utf-8 ~*~
    import re
    
    def remove_accents(string):
        if type(string) is not unicode:
            string = unicode(string, encoding='utf-8')
    
        string = re.sub(u"[àáâãäå]", 'a', string)
        string = re.sub(u"[èéêë]", 'e', string)
        string = re.sub(u"[ìíîï]", 'i', string)
        string = re.sub(u"[òóôõö]", 'o', string)
        string = re.sub(u"[ùúûü]", 'u', string)
        string = re.sub(u"[ýÿ]", 'y', string)
    
        return string
    

    I like that function because you can customize it in case you need to ignore other characters

    0 讨论(0)
  • 2020-12-13 22:30

    The given URL returns UTF-8 as the HTTP response clearly indicates:

    wget -S http://dictionary.reference.com/browse/apple?s=t
    --2013-01-02 08:43:40--  http://dictionary.reference.com/browse/apple?s=t
    Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
    Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
    HTTP request sent, awaiting response... 
      HTTP/1.1 200 OK
      Server: Apache
      Cache-Control: private
      Content-Type: text/html;charset=UTF-8
      Date: Wed, 02 Jan 2013 07:43:40 GMT
      Transfer-Encoding:  chunked
      Connection: keep-alive
      Connection: Transfer-Encoding
      Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
      Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
      Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
      Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
      Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
      Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
      Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
    Length: unspecified [text/html]
    

    Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.

    0 讨论(0)
提交回复
热议问题