问题
Please download the file in dropbox and save it as /tmp/target.html.
target.html
Open it in firefox with firebug to watch the html struture.
It is clear that there are at least 10 div in target.html.
Now to parse all div elements in the target.html with lxml.html.
python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> divs=doc.xpath("//div")
>>> len(divs)
4
Get the result 4,why so many divs can't be parsed with above code?
At lease 10 divs in the target.html.
Same thing for parsing table in target.html too.
There are at least 9 tables in target.html,please check it with firebug.
python3
>>> import lxml.html
>>> doc=lxml.html.parse("/tmp/target.html")
>>> tables=doc.xpath("//table")
>>> len(tables)
3
回答1:
Thank to sideshowbarker.
sudo pip3 install html5lib
To install html5lib with pip at first.
import html5lib;
doc = html5lib.parse(open('/tmp/target.html', 'rb'), treebuilder='lxml', namespaceHTMLElements=False);
divs=doc.xpath('//div');
tables=doc.xpath('//table');
print(len(divs));
print(len(tables));
来源:https://stackoverflow.com/questions/51586527/why-cant-parse-all-div-elements-in-the-target-html-with-lxml-html