Parsing very large HTML file with Python (ElementTree?)

*爱你&永不变心* 提交于 2019-12-08 00:10:41

问题


I asked about using BeautifulSoup to parse a very large (270MB) HTML file and getting a memory error andwas pointed toward ElementTree as a solution.

I was trying to use their event-driven parsing, documented here. Testing it with the smaller settings file worked fine:

>>> settings = open('S:\\Documents\\FacebookData\\html\\settings.htm')
>>> for event, element in ET.iterparse(settings, events=("start", "end")):
    print("%5s, %4s, %s" % (event, element.tag, element.text))

Successfully prints out the elements. However, using that same code with 'messages.htm' instead of 'settings.htm' just to see if it's working before even beginning the actual coding process, this is the result:

Traceback (most recent call last):
  File "<pyshell#16>", line 1, in <module>
    for event, element in ET.iterparse(source, events=("start", "end")):
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1294, in __next__
for event in self._parser.read_events():
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1277, in read_events
raise event
  File "C:\Program Files (x86)\Python\lib\xml\etree\ElementTree.py", line 1235, in feed
self._parser.feed(data)
  File "<string>", line None
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 1, column 6

I'm wondering if this is because ET is just better suited to parsing XML documents? If this is the case, and there's no workaround, then I'm back to square one. Any suggestions on how to parse this file, along with how to debug along the way would be greatly appreciated!


回答1:


Html is not a perfect XML. That why in some case, you have use HTMLParser instead of ElementTree to parse html file.

Best regard Emmanuel




回答2:


A good solution for parsing HTML or XML is lxml and xpath.

To use xpath:

from lxml import etree
data = open('result.html','r').read()
doc = etree.HTML(data)

for tr in doc.xpath('//table/tr[@class="trmenu1"]'):
    print tr.xpath('./td/text()')


来源:https://stackoverflow.com/questions/31225193/parsing-very-large-html-file-with-python-elementtree

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!