python xml iterating over elements takes a lot of memory

问题

I have some very big XML files (around ~100-150 MB each).

One element in my XML is M (for member), which is a child of HH (household) -

i.e. - each household contains one or more members.

What I need to do is to take all the members that satisfies some conditions (the conditions can change, and can be both on the household and on the members - e.g. - just members from households with high income (constraint on the household), who's age is between 18-49 (constraint on the member)) - and to further process them in a rather complicated function.

this is what I'm doing:

import lxml.etree as ET
all_members=[]
tree=ET.parse(whole_path)
root=tree.getroot()
HH_str='//H' #get all the households
HH=tree.xpath(HH_str)
for H in HH:
'''check if the hh satisfies the condition'''
    if(is_valid_hh(H)):
        M_str='.//M'
        M=H.xpath(M_str)
        for m in M:
            if(is_valid_member(m)):
                all_members.append(m)

for member in all_members:
'''do something complicated'''

the problem with this is that it takes all my memory (and I have 32 GB)! how can I iterate over xml elements more efficiently?

any help will be appreciated...

回答1:

etree is going to consume a lot of memory (yes, even with iterparse()), and sax is really clunky. However, pulldom to the rescue!

from xml.dom import pulldom
doc = pulldom.parse('large.xml')
for event, node in doc:
    if event == pulldom.START_ELEMENT and node.tagName == 'special': 
        # Node is 'empty' here       
        doc.expandNode(node)
        # Now we got it all
        if is_valid_hh(node):
            ...do things...

It's one of those libraries no one who did not have to use it seems to know about. Docs at e.g. https://docs.python.org/3.7/library/xml.dom.pulldom.html

来源：https://stackoverflow.com/questions/47963080/python-xml-iterating-over-elements-takes-a-lot-of-memory

标签

python

xml

list

xpath

generator