问题
I have a xml file, which I need to convert to utf8. Unfortunately the entities contain text like this:
/mytext,
I'm using the codec library to convert files to utf8, but html entities won't work with it.
Is there an easy way to get rid of the html encoding?
Thanks
回答1:
You can pass the text of the file through an unescape function before passing it to the XML parser.
Alternatively, if you're only parsing HTML, lxml's http parser does this for you:
>>> import lxml.html
>>> html = lxml.html.fromstring("<html><body><p>/mytext,</p></body></html>")
>>> lxml.html.tostring(html)
'<html><body><p>/mytext,</p></body></html>'
回答2:
Recently posted the below in response to a similar question:
import HTMLParser # html.parser in Python 3
h = HTMLParser.HTMLParser()
h.unescape('/mytext,')
Technically this method is "internal" and undocumented, but it's been in the API quite a while and isn't marked with a leading underscore.
Found it here; other approaches are also mentioned, of which BeautifulSoup is probably the best if you don't mind its "heaviness."
来源:https://stackoverflow.com/questions/9487133/python-convert-html-ascii-encoded-text-to-utf8