Why can't parse all div elements in the target.html with lxml.html?

不羁的心 提交于 2020-01-06 02:46:13

问题


Please download the file in dropbox and save it as /tmp/target.html.

target.html

Open it in firefox with firebug to watch the html struture.

It is clear that there are at least 10 div in target.html. Now to parse all div elements in the target.html with lxml.html.

python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> divs=doc.xpath("//div")
>>> len(divs)
4

Get the result 4,why so many divs can't be parsed with above code?
At lease 10 divs in the target.html. Same thing for parsing table in target.html too.
There are at least 9 tables in target.html,please check it with firebug.

python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> tables=doc.xpath("//table")
>>> len(tables)
3

回答1:


Thank to sideshowbarker.

sudo pip3 install  html5lib

To install html5lib with pip at first.

import html5lib; 
doc = html5lib.parse(open('/tmp/target.html', 'rb'), treebuilder='lxml', namespaceHTMLElements=False); 
divs=doc.xpath('//div'); 
tables=doc.xpath('//table');
print(len(divs));
print(len(tables));


来源:https://stackoverflow.com/questions/51586527/why-cant-parse-all-div-elements-in-the-target-html-with-lxml-html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!