lxml truncates text that contains 'less than' character

问题

>>> s = '<div> < 20 </div>'
>>> import lxml.html
>>> tree = lxml.html.fromstring(s)
>>> lxml.etree.tostring(tree)
'<div> </div>'

Does anybody know any workaround for this?

回答1:

Your HTML input is broken; that < left angle bracket should have been encoded to < instead. From the lxml documentation on parsing broken HTML:

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

In other words, you take what you can get from such documents, the way lxml handles broken HTML is not otherwise configurable.

One thing you could try is to use a different HTML parser. Try BeautifulSoup instead, it's broken HTML handling may be able to give you a different version of that document that does give you what you want out of it. BeautifulSoup can re-use different parser backends, including lxml and html5lib, so it'll give you more flexibility.

The html5lib parser does give you the < character (converted to a < escape):

>>> BeautifulSoup("<div> < 20 </div>", "html5lib")
<html><head></head><body><div> &lt; 20 </div></body></html>

回答2:

Your < should actually be <, since < is sorta like a 'reserved character' in html. Then it should work.

来源：https://stackoverflow.com/questions/14171035/lxml-truncates-text-that-contains-less-than-character

标签

python

html-parsing

lxml