问题
I have some very big XML files (around ~100-150 MB each).
One element in my XML is M
(for member), which is a child of HH
(household) -
i.e. - each household contains one or more members.
What I need to do is to take all the members that satisfies some conditions (the conditions can change, and can be both on the household and on the members - e.g. - just members from households with high income (constraint on the household), who's age is between 18-49 (constraint on the member)) - and to further process them in a rather complicated function.
this is what I'm doing:
import lxml.etree as ET
all_members=[]
tree=ET.parse(whole_path)
root=tree.getroot()
HH_str='//H' #get all the households
HH=tree.xpath(HH_str)
for H in HH:
'''check if the hh satisfies the condition'''
if(is_valid_hh(H)):
M_str='.//M'
M=H.xpath(M_str)
for m in M:
if(is_valid_member(m)):
all_members.append(m)
for member in all_members:
'''do something complicated'''
the problem with this is that it takes all my memory (and I have 32 GB)! how can I iterate over xml elements more efficiently?
any help will be appreciated...
回答1:
etree
is going to consume a lot of memory (yes, even with iterparse()
), and sax
is really clunky. However, pulldom
to the rescue!
from xml.dom import pulldom
doc = pulldom.parse('large.xml')
for event, node in doc:
if event == pulldom.START_ELEMENT and node.tagName == 'special':
# Node is 'empty' here
doc.expandNode(node)
# Now we got it all
if is_valid_hh(node):
...do things...
It's one of those libraries no one who did not have to use it seems to know about. Docs at e.g. https://docs.python.org/3.7/library/xml.dom.pulldom.html
来源:https://stackoverflow.com/questions/47963080/python-xml-iterating-over-elements-takes-a-lot-of-memory