Using BeautifulSoup on very large HTML file - memory error?

丶灬走出姿态 提交于 2019-12-24 03:15:53

问题


I'm learning Python by working on a project - a Facebook message analyzer. I downloaded my data, which includes a messages.htm file of all my messages. I'm trying to write a program to parse this file and output data (# of messages, most common words, etc.)

However, my messages.htm file is 270MB. When creating a BeautifulSoup object in the shell for testing, any other file (all < 1MB) works just fine. But I can't create a bs object of messages.htm. Here's the error:

>>> mf = open('messages.htm', encoding="utf8")
>>> ms = bs4.BeautifulSoup(mf)
Traceback (most recent call last):
  File "<pyshell#73>", line 1, in <module>
    ms = bs4.BeautifulSoup(mf)
  File "C:\Program Files (x86)\Python\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
  File "C:\Program Files (x86)\Python\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

So I can't even begin working with this file. This is my first time tackling something like this and I'm only just learning Python so any suggestions would be much appreciated!


回答1:


As you're using this as a learning exercise, I won't give too much code. You may be better off with ElementTree's iterparse to allow you to process as you parse. BeautifulSoup doesn't have this functionality as far as I am aware.

To get you started:

import xml.etree.cElementTree as ET

with open('messages.htm') as source:

    # get an iterable
    context = ET.iterparse(source, events=("start", "end"))

    # turn it into an iterator
    context = iter(context)

    # get the root element
    event, root = context.next()

    for event, elem in context:
        # do something with elem

        # get rid of the elements after processing
        root.clear()

If you're set on using BeautifulSoup, you could look into splitting the source HTML into manageable chunks, but you'd need to be careful to keep the thread-message structure and ensure you keep valid HTML.



来源:https://stackoverflow.com/questions/31201434/using-beautifulsoup-on-very-large-html-file-memory-error

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!