Using BeautifulSoup on very large HTML file - memory error?

问题

I'm learning Python by working on a project - a Facebook message analyzer. I downloaded my data, which includes a messages.htm file of all my messages. I'm trying to write a program to parse this file and output data (# of messages, most common words, etc.)

However, my messages.htm file is 270MB. When creating a BeautifulSoup object in the shell for testing, any other file (all < 1MB) works just fine. But I can't create a bs object of messages.htm. Here's the error:

>>> mf = open('messages.htm', encoding="utf8")
>>> ms = bs4.BeautifulSoup(mf)
Traceback (most recent call last):
  File "<pyshell#73>", line 1, in <module>
    ms = bs4.BeautifulSoup(mf)
  File "C:\Program Files (x86)\Python\lib\site-packages\bs4\__init__.py", line 161, in __init__
markup = markup.read()
  File "C:\Program Files (x86)\Python\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
MemoryError

So I can't even begin working with this file. This is my first time tackling something like this and I'm only just learning Python so any suggestions would be much appreciated!

回答1:

As you're using this as a learning exercise, I won't give too much code. You may be better off with ElementTree's iterparse to allow you to process as you parse. BeautifulSoup doesn't have this functionality as far as I am aware.

To get you started:

import xml.etree.cElementTree as ET

with open('messages.htm') as source:

    # get an iterable
    context = ET.iterparse(source, events=("start", "end"))

    # turn it into an iterator
    context = iter(context)

    # get the root element
    event, root = context.next()

    for event, elem in context:
        # do something with elem

        # get rid of the elements after processing
        root.clear()

If you're set on using BeautifulSoup, you could look into splitting the source HTML into manageable chunks, but you'd need to be careful to keep the thread-message structure and ensure you keep valid HTML.

来源：https://stackoverflow.com/questions/31201434/using-beautifulsoup-on-very-large-html-file-memory-error

标签

python

html

parsing

beautifulsoup

html-parsing