Parsing large XML using iterparse() consumes too much memory. Any alternative?

ぃ、小莉子 提交于 2020-01-15 06:16:12

问题


I am using python 2.7 with latest lxml library. I am parsing a large XML file with very homogenous structure and millions of elements. I thought lxml's iterparse would not build an internal tree while it parses, but apparently it does since memory usage grows until it crashes (around 1GB). Is there a way to parse large XML file using lxml without using a lot of memory?

I saw the target parser interface as one possibility, but I'm not sure if that will work any better.


回答1:


Try using Liza Daly's fast_iter:

def fast_iter(context, func, args=[], kwargs={}):
    # http://www.ibm.com/developerworks/xml/library/x-hiperfparse/
    # Author: Liza Daly
    for event, elem in context:
        func(elem, *args, **kwargs)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

fast_iter removes elements from the tree after they have been parsed, and also previous elements (maybe with other tags) that are no longer needed.

It could be used like this:

import lxml.etree as ET
def process_element(elem):
    ...
context=ET.iterparse(filename, events=('end',), tag=...)        
fast_iter(context, process_element)



回答2:


I had this problem and solved it with a hint from http://effbot.org/zone/element-iterparse.htm#incremental-parsing:

elems = ET.Element('MyElements')
for event, elem in ET.iterparse(filename):
    if is_needed(elem): # implement this condition however you like
        elems.append(elem)
    else:
        elem.clear()

This gives you a tree with only the elements you need, without requiring unnecessary memory during parsing.



来源:https://stackoverflow.com/questions/7972823/parsing-large-xml-using-iterparse-consumes-too-much-memory-any-alternative

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!