问题
Please download the file in dropbox and save it as /tmp/target.html
.
target.html
Open it in firefox with firebug to watch the html struture.
It is clear that there are at least 10 div in target.html
.
Now to parse all div elements in the target.html with lxml.html.
python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> divs=doc.xpath("//div")
>>> len(divs)
4
Get the result 4
,why so many divs can't be parsed with above code?
At lease 10 divs in the target.html
.
Same thing for parsing table in target.html
too.
There are at least 9 tables in target.html
,please check it with firebug.
python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> tables=doc.xpath("//table")
>>> len(tables)
3
回答1:
Thank to sideshowbarker.
sudo pip3 install html5lib
To install html5lib with pip at first.
import html5lib;
doc = html5lib.parse(open('/tmp/target.html', 'rb'), treebuilder='lxml', namespaceHTMLElements=False);
divs=doc.xpath('//div');
tables=doc.xpath('//table');
print(len(divs));
print(len(tables));
来源:https://stackoverflow.com/questions/51586527/why-cant-parse-all-div-elements-in-the-target-html-with-lxml-html