问题
I'm attempting to use the lxml module to parse HTML files, but am struggling to get it to work with some UTF-8 encoded data. I'm using Python 2.7 on Windows. For example, consider a UTF-8 encoded file without byte order mark that contains nothing but the text string Québec
. If I just read the contents of the file using a regular file handler and decode the resulting string object, I get a length 6 unicode string that looks good when written back to a file. But if I parse the file with lxml, I see get a length 7 unicode string that looks odd when written back to a file. Can someone explain what is happening differently with lxml and how to get the original, pretty string?
For example:
import lxml.html as html
from lxml import etree
f = open("output.txt", "w")
text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))
root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))
Produces output in output.txt
of:
String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec
EDIT
A partial workaround here seems to be to parse the file using:
etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))
or
html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))
However, as far as I know the base etree library lacks some convenience classes for things like selectors, so a solution that allows me to use lxml.html without etree.HTMLParser() would still be useful.
回答1:
The function lxml.html.parse
already uses an instance of lxml.html.HTMLParser, so you shouldn't really be averse to using
html.parse("input.txt", html.HTMLParser(encoding="utf-8"))
to handle the utf-8 data
来源:https://stackoverflow.com/questions/9284480/cannot-properly-display-unicode-string-after-parsing-a-file-with-lxml-works-fin