Parsing very large HTML file with Python (ElementTree?)

问题

I asked about using BeautifulSoup to parse a very large (270MB) HTML file and getting a memory error andwas pointed toward ElementTree as a solution.

I was trying to use their event-driven parsing, documented here. Testing it with the smaller settings file worked fine:

>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
    print("%5s, %4s, %s" % (event, element.tag, element.text))

Successfully prints out the elements. However, using that same code with 'messages.htm' instead of 'settings.htm' just to see if it's working before even beginning the actual coding process, this is the result:

Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    for event, element in ET.iterparse(source, events=("start", "end")):
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6

I'm wondering if this is because ET is just better suited to parsing XML documents? If this is the case, and there's no workaround, then I'm back to square one. Any suggestions on how to parse this file, along with how to debug along the way would be greatly appreciated!

回答1:

Html is not a perfect XML. That why in some case, you have use HTMLParser instead of ElementTree to parse html file.

Best regard Emmanuel

回答2:

A good solution for parsing HTML or XML is lxml and xpath.

To use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
    print tr.xpath('./td/text()')

来源：https://stackoverflow.com/questions/31225193/parsing-very-large-html-file-with-python-elementtree

标签

python

html

parsing

html-parsing

elementtree