Python convert html ascii encoded text to utf8

问题

I have a xml file, which I need to convert to utf8. Unfortunately the entities contain text like this:

&#047;mytext&#044;

I'm using the codec library to convert files to utf8, but html entities won't work with it.

Is there an easy way to get rid of the html encoding?

Thanks

回答1:

You can pass the text of the file through an unescape function before passing it to the XML parser.

Alternatively, if you're only parsing HTML, lxml's http parser does this for you:

>>> import lxml.html
>>> html = lxml.html.fromstring("<html><body><p>&#047;mytext&#044;</p></body></html>")
>>> lxml.html.tostring(html)
'<html><body><p>/mytext,</p></body></html>'

回答2:

Recently posted the below in response to a similar question:

import HTMLParser     # html.parser in Python 3
h = HTMLParser.HTMLParser()
h.unescape('&#047;mytext&#044;')

Technically this method is "internal" and undocumented, but it's been in the API quite a while and isn't marked with a leading underscore.

Found it here; other approaches are also mentioned, of which BeautifulSoup is probably the best if you don't mind its "heaviness."

来源：https://stackoverflow.com/questions/9487133/python-convert-html-ascii-encoded-text-to-utf8

标签

python

encoding

utf-8

ascii

html-entities

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!