问题
I'm currently parsing a large XML file of the following form in a python-flask webapp on heroku:
<book name="bookname">
<volume n="1" name="volume1name">
<chapter n="1">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
<chapter n="2">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
</volume>
<volume n="2" name="volume2name">
<chapter n="1">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
<chapter n="2">
<li n="1">li 1 content</li>
<li n="2">li 2 content</li>
</chapter/>
</volume>
</book>
The code that I use to parse, analyze it, and display it through Flask is as the following:
from lxml import etree
file = open("books/filename.xml")
parser = etree.XMLParser(recover=True)
tree = etree.parse(file, parser)
root = tree.getroot()
def getChapter(volume, chapter):
i = 0
data = []
while True:
try:
data.append(root[volumeList().index(volume)][chapter-1][i].text)
except IndexError:
break
i += 1
if data == []:
data = None
return data
def volumeList():
data = tree.xpath('//volume/@name')
return data
def chapterCount(volume):
currentChapter = 1
count = 0
while True:
data = getChapter(volume, currentChapter)
if data == None:
break
else:
count += 1
currentChapter += 1
return count
def volumeNumerate():
list = volumeList()
i = 1
dict = {}
for element in list:
dict[i] = element
i += 1
return dict
def render_default_values(template, **kwargs):
chapter = getChapter(session['volume'],session['chapter'])
count = chapterCount(session['volume'])
return render_template(template, chapter=chapter, count=count, **kwargs)
@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
session['volume'] = volume
session['chapter'] = chapter
return render_default_values("index.html")
The issue that I am having is that whenever Flask is trying to render a volume with many chapters, (whenever chapterCount(session['volume']) > about 50 or so), the loading and processing of the page takes a very long time. In comparison, if the app is loading a volume that has say under 10/15 chapters, the loading is almost instantaneous, even as a live webapp. As such, is there a good way that I can optimize this, and improve the speed and performance? Thanks a lot!
(PS: For reference, this is my old getChapter function, that I stopped using since I don't want to refer to an individual `li' in the code and want the code to work with any generic XML file. It was considerably faster than the current getChapter function though!:
def OLDgetChapter(volume, chapter):
data = tree.xpath('//volume[@name="%s"]/chapter[@n=%d]/li/text()'%(volume,chapter))
if data == []:
data = None
return data
Thanks a lot!
回答1:
Have you heard about BeautifulSoup?
BeautifulSoup does the tedious work on parsing xml
for you, except it does it in C.
I'm positively sure this will be much faster (and much more readable):
from bs4 import BeautifulSoup
filename = "test.xml"
soup = BeautifulSoup(open(filename), "xml")
def chapterCount(volume_name):
volume = soup.find("volume", attrs={"name": volume_name})
chapter_count = len(volume.find_all("chapter", recursive=False))
return chapter_count
def getChapter(volume_name, chapter_number):
volume = soup.find("volume", {"name": volume_name})
chapter = volume.find("chapter", {"n": chapter_number})
items = [ content for content in chapter.contents if content != "\n" ]
return "\n".join([ item.contents[0] for item in items ])
# from now on, it's the same as your original code
def render_default_values(template, **kwargs):
chapter = getChapter(session['volume'],session['chapter'])
count = chapterCount(session['volume'])
return render_template(template, chapter=chapter, count=count, **kwargs)
@app.route('/<volume>/<int:chapter>')
def goto(volume, chapter):
session['volume'] = volume
session['chapter'] = chapter
return render_default_values("index.html")
Note that not only the getChapter
function will be faster, but the main point is that you won't have to iterate over it for each chapter when you want to count the chapters in a specific volume through chapterCount
. Both functions are now totally independent from each other.
Results from both functions:
>>> print(chapterCount("volume1name"))
2
>>> print(getChapter("volume1name", 2))
li 1 content
li 2 content
EDIT:
I just asked a question to see if there could be a faster way to count the chapters. Stay tuned :) - Update: the answer is that you can use recursive=False
to prevent BS from returning the entire tree of the elements found with find_all
. Or, directly use lxml
.
EDIT:
I just noticed that you call render_default_values
in your view. You shouldn't do that, or at least you should call this function a different way. Because "render default values" means... well, render default values.
Allowing this function to render something else based on a global variable (session
) is considered not very Pythonic and can lead to spaghetti code (unknown bugs, etc).
回答2:
If you are concerned about speed, instead of iterating over all volumes and chapters to find appropriate values of name
and n
attributes, get it in a single go with a single xpath expression (just noticed that this is exactly your old approach). But, instead of asking for li
, ask for any element with *
:
//volume[@name="%s"]/chapter[@n="%s"]/*/text()
where %s
are placeholders for volume
and chapter
values passed in.
def getChapter(volume, chapter):
return root.xpath('//volume[@name="%s"]/chapter[@n="%s"]/*/text()' % (volume, chapter))
Demo:
>>> from lxml import etree
>>>
>>> parser = etree.XMLParser(recover=True)
>>> tree = etree.parse(open("test.xml"), parser)
>>> root = tree.getroot()
>>>
>>> volume = 'volume1name'
>>> chapter = 2
>>>
>>> xpath = '//volume[@name="%s"]/chapter[@n="%s"]/*/text()' % (volume, chapter)
>>> root.xpath(xpath)
['li 1 content', 'li 2 content']
来源:https://stackoverflow.com/questions/27673349/python-xml-parsing-algorithm-speed