lxml truncates text that contains 'less than' character

混江龙づ霸主 提交于 2019-12-18 06:21:10

问题


>>> s = '<div> < 20 </div>'
>>> import lxml.html
>>> tree = lxml.html.fromstring(s)
>>> lxml.etree.tostring(tree)
'<div> </div>'

Does anybody know any workaround for this?


回答1:


Your HTML input is broken; that < left angle bracket should have been encoded to &lt; instead. From the lxml documentation on parsing broken HTML:

The support for parsing broken HTML depends entirely on libxml2's recovery algorithm. It is not the fault of lxml if you find documents that are so heavily broken that the parser cannot handle them. There is also no guarantee that the resulting tree will contain all data from the original document. The parser may have to drop seriously broken parts when struggling to keep parsing. Especially misplaced meta tags can suffer from this, which may lead to encoding problems.

In other words, you take what you can get from such documents, the way lxml handles broken HTML is not otherwise configurable.

One thing you could try is to use a different HTML parser. Try BeautifulSoup instead, it's broken HTML handling may be able to give you a different version of that document that does give you what you want out of it. BeautifulSoup can re-use different parser backends, including lxml and html5lib, so it'll give you more flexibility.

The html5lib parser does give you the < character (converted to a &lt; escape):

>>> BeautifulSoup("<div> < 20 </div>", "html5lib")
<html><head></head><body><div> &lt; 20 </div></body></html>



回答2:


Your < should actually be &lt;, since < is sorta like a 'reserved character' in html. Then it should work.



来源:https://stackoverflow.com/questions/14171035/lxml-truncates-text-that-contains-less-than-character

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!