Reading 1000s of XML documents with BeautifulSoup

问题

I'm trying to read a bunch of xml files and do stuff to them. The first thing I want to do is rename them based on a number that's inside the file.

You can see a sample of the data here^{Warning this will initiate a download of a 108MB zip file!}. That's a huge xml file with thousands of smaller xml files inside it. I've broken those out into individual files. I want to rename the files based on a number inside (part of preprocessing). I have the following code:

from __future__ import print_function
from bs4 import BeautifulSoup # To get everything
import os

def rename_xml_files(directory):
    xml_files = [xml_file for xml_file in os.listdir(directory) ]

    for filename in xml_files:
        filename = filename.strip()
        full_filename = directory + "/" +filename
        print (full_filename)
        f = open(full_filename, "r")
        xml = f.read()
        soup = BeautifulSoup(xml)
        del xml
        del soup
        f.close()

If I comment out the "soup =" and "del" lines, it works perfectly. If I add the "soup = ..." line, it will work for a moment and then it will eventually crap out - it just crashes the python kernel. I'm using Enthought Canopy, but I've tried it running from the command line and it craps out there, too.

I thought, perhaps, it was not deallocating the space for the variable "soup" so I tried adding the "del" commands. Same problem.

Any thoughts on how to circumvent this? I'm not stuck on BS. If there's a better way of doing this, I would love it, but I need a little sample code.

回答1:

Try using cElementTree.parse() from Python's standard xml library instead of BeautifulSoup. 'Soup is great for parsing normal web pages, but cElementTree is blazing fast.

Like this:

import xml.etree.cElementTree as cET

# ...

def rename_xml_files(directory):
    xml_files = [xml_file for xml_file in os.listdir(directory) ]

    for filename in xml_files:
        filename = filename.strip()
        full_filename = directory + "/" +filename
        print(full_filename)
        parsed = cET.parse(full_filename)
        del parsed

If your XML formatted correctly this should parse it. If your machine is still unable to handle all that data in memory, you should look into streaming the XML.

回答2:

I would not separate that file out into many small files and then process them some more, I would process them all in one go.

I would just use a streaming api XML parser and parse the master file, get the name and write out the sub-files once with the correct name.

There is no need for BeautifulSoup which is primarily designed to handle HTML and uses a document model instead of a streaming parser.

There is no need for what you are doing to build an entire DOM just to get a single element all at once.

来源：https://stackoverflow.com/questions/30267755/reading-1000s-of-xml-documents-with-beautifulsoup

标签

python

xml

beautifulsoup

enthought