Python XML Parsing Algorithm Speed

问题

I'm currently parsing a large XML file of the following form in a python-flask webapp on heroku:

<book name="bookname">
  <volume n="1" name="volume1name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
  <volume n="2" name="volume2name">
    <chapter n="1">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
    <chapter n="2">
       <li n="1">li 1 content</li>
       <li n="2">li 2 content</li>
    </chapter/>
  </volume>
</book>

The code that I use to parse, analyze it, and display it through Flask is as the following:

from lxml import etree
file = open("books/filename.xml")
parser = etree.XMLParser(recover=True)
tree = etree.parse(file, parser)
root = tree.getroot()

def getChapter(volume, chapter):
    i = 0
    data = []
    while True:
        try:
            data.append(root[volumeList().index(volume)][chapter-1][i].text)
        except IndexError:
            break
        i += 1
    if data == []:
        data = None
    return data

def volumeList():
    data = tree.xpath('//volume/@name')
    return data

def chapterCount(volume):
    currentChapter = 1
    count = 0
    while True:
        data = getChapter(volume, currentChapter)
        if data == None:
            break
        else:
            count += 1
            currentChapter += 1
    return count

def volumeNumerate():
    list = volumeList()
    i = 1
    dict = {}
    for element in list:
        dict[i] = element
        i += 1
    return dict

def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)

@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

The issue that I am having is that whenever Flask is trying to render a volume with many chapters, (whenever chapterCount(session['volume']) > about 50 or so), the loading and processing of the page takes a very long time. In comparison, if the app is loading a volume that has say under 10/15 chapters, the loading is almost instantaneous, even as a live webapp. As such, is there a good way that I can optimize this, and improve the speed and performance? Thanks a lot!

(PS: For reference, this is my old getChapter function, that I stopped using since I don't want to refer to an individual `li' in the code and want the code to work with any generic XML file. It was considerably faster than the current getChapter function though!:

def OLDgetChapter(volume, chapter):
    data = tree.xpath('//volume[@name="%s"]/chapter[@n=%d]/li/text()'%(volume,chapter))
    if data == []:
        data = None
    return data

Thanks a lot!

回答1:

Have you heard about BeautifulSoup?

BeautifulSoup does the tedious work on parsing xml for you, except it does it in C.

I'm positively sure this will be much faster (and much more readable):

from bs4 import BeautifulSoup

filename = "test.xml"
soup = BeautifulSoup(open(filename), "xml")

def chapterCount(volume_name):
    volume = soup.find("volume", attrs={"name": volume_name})
    chapter_count = len(volume.find_all("chapter", recursive=False))
    return chapter_count

def getChapter(volume_name, chapter_number):
    volume = soup.find("volume", {"name": volume_name})
    chapter = volume.find("chapter", {"n": chapter_number})
    items = [ content for content in chapter.contents if content != "\n" ]
    return "\n".join([ item.contents[0] for item in items ])


# from now on, it's the same as your original code

def render_default_values(template, **kwargs):
    chapter = getChapter(session['volume'],session['chapter'])
    count = chapterCount(session['volume'])
    return render_template(template, chapter=chapter, count=count, **kwargs)

@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
    session['volume'] = volume
    session['chapter'] = chapter
    return render_default_values("index.html")

Note that not only the getChapter function will be faster, but the main point is that you won't have to iterate over it for each chapter when you want to count the chapters in a specific volume through chapterCount. Both functions are now totally independent from each other.

Results from both functions:

>>> print(chapterCount("volume1name"))
2

>>> print(getChapter("volume1name", 2))
li 1 content
li 2 content

EDIT:

I just asked a question to see if there could be a faster way to count the chapters. Stay tuned :) - Update: the answer is that you can use recursive=False to prevent BS from returning the entire tree of the elements found with find_all. Or, directly use lxml.

EDIT:

I just noticed that you call render_default_values in your view. You shouldn't do that, or at least you should call this function a different way. Because "render default values" means... well, render default values.

Allowing this function to render something else based on a global variable (session) is considered not very Pythonic and can lead to spaghetti code (unknown bugs, etc).

回答2:

If you are concerned about speed, instead of iterating over all volumes and chapters to find appropriate values of name and n attributes, get it in a single go with a single xpath expression (just noticed that this is exactly your old approach). But, instead of asking for li, ask for any element with *:

//volume[@name="%s"]/chapter[@n="%s"]/*/text()

where %s are placeholders for volume and chapter values passed in.

def getChapter(volume, chapter):
    return root.xpath('//volume[@name="%s"]/chapter[@n="%s"]/*/text()' % (volume, chapter))

Demo:

>>> from lxml import etree
>>> 
>>> parser = etree.XMLParser(recover=True)
>>> tree = etree.parse(open("test.xml"), parser)
>>> root = tree.getroot()
>>> 
>>> volume = 'volume1name'
>>> chapter = 2
>>> 
>>> xpath = '//volume[@name="%s"]/chapter[@n="%s"]/*/text()' % (volume, chapter)
>>> root.xpath(xpath)
['li 1 content', 'li 2 content']

来源：https://stackoverflow.com/questions/27673349/python-xml-parsing-algorithm-speed

标签

python

xml

optimization

Flask

lxml