Cannot properly display unicode string after parsing a file with lxml, works fine with simple file read

大城市里の小女人 提交于 2019-12-25 08:20:04

问题


I'm attempting to use the lxml module to parse HTML files, but am struggling to get it to work with some UTF-8 encoded data. I'm using Python 2.7 on Windows. For example, consider a UTF-8 encoded file without byte order mark that contains nothing but the text string Québec. If I just read the contents of the file using a regular file handler and decode the resulting string object, I get a length 6 unicode string that looks good when written back to a file. But if I parse the file with lxml, I see get a length 7 unicode string that looks odd when written back to a file. Can someone explain what is happening differently with lxml and how to get the original, pretty string?

For example:

import lxml.html as html
from lxml import etree

f = open("output.txt", "w")

text = open("input.txt").read().decode("utf-8")
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

root = html.parse("input.txt")
text = root.xpath(".//p")[0].text.strip()
f.write("String of type '%s' with length %d: %s\n" % (type(text), len(text), text.encode("utf-8")))

Produces output in output.txt of:

String of type '<type 'unicode'>' with length 6: Québec
String of type '<type 'unicode'>' with length 7: Québec

EDIT

A partial workaround here seems to be to parse the file using:

etree.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

or

html.parse("input.txt", etree.HTMLParser(encoding="utf-8"))

However, as far as I know the base etree library lacks some convenience classes for things like selectors, so a solution that allows me to use lxml.html without etree.HTMLParser() would still be useful.


回答1:


The function lxml.html.parse already uses an instance of lxml.html.HTMLParser, so you shouldn't really be averse to using

html.parse("input.txt", html.HTMLParser(encoding="utf-8"))

to handle the utf-8 data



来源:https://stackoverflow.com/questions/9284480/cannot-properly-display-unicode-string-after-parsing-a-file-with-lxml-works-fin

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!