How to convert unicode accented characters to pure ascii without accents?

匿名 (未验证) 提交于 2019-12-03 02:05:01

问题:

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t

The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.

My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?

Python calling code:

import os word = 'apple' os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word) 

I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.

回答1:

how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?

Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...

import unicodedata output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore') 

Explicit example...

>>> myfoo = u'àà' >>> myfoo u'\xe0\xe0' >>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore') 'aa' >>> 

How it works
unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).



回答2:

I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:

I like that function because you can customize it in case you need to ignore other characters



回答3:

The given URL returns UTF-8 as the HTTP response clearly indicates:

wget -S http://dictionary.reference.com/browse/apple?s=t --2013-01-02 08:43:40--  http://dictionary.reference.com/browse/apple?s=t Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11 Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected. HTTP request sent, awaiting response...    HTTP/1.1 200 OK   Server: Apache   Cache-Control: private   Content-Type: text/html;charset=UTF-8   Date: Wed, 02 Jan 2013 07:43:40 GMT   Transfer-Encoding:  chunked   Connection: keep-alive   Connection: Transfer-Encoding   Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/   Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/   Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/   Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/   Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/   Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/   Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/ Length: unspecified [text/html] 

Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.



标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!