Cannot properly display unicode string after parsing a file with lxml, works fine with simple file read

问题

I'm attempting to use the lxml module to parse HTML files, but am struggling to get it to work with some UTF-8 encoded data. I'm using Python 2.7 on Windows. For example, consider a UTF-8 encoded file without byte order mark that contains nothing but the text string Québec. If I just read the contents of the file using a regular file handler and decode the resulting string object, I get a length 6 unicode string that looks good when written back to a file. But if I parse the file with lxml, I see get a length 7 unicode string that looks odd when written back to a file. Can someone explain what is happening differently with lxml and how to get the original, pretty string?

For example:

import lxml.html as html
from lxml import etree

f = open("output.txt", "w")

text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

Produces output in output.txt of:

String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: QuÃ©bec

EDIT

A partial workaround here seems to be to parse the file using:

etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

However, as far as I know the base etree library lacks some convenience classes for things like selectors, so a solution that allows me to use lxml.html without etree.HTMLParser() would still be useful.

回答1:

The function lxml.html.parse already uses an instance of lxml.html.HTMLParser, so you shouldn't really be averse to using

html.parse("input.txt", html.HTMLParser(encoding="utf-8"))

to handle the utf-8 data

来源：https://stackoverflow.com/questions/9284480/cannot-properly-display-unicode-string-after-parsing-a-file-with-lxml-works-fin

标签

python

lxml