问题
I need to parse a very large (~40GB) XML file, remove certain elements from it, and write the result to a new xml file. I've been trying to use iterparse from python's ElementTree, but I'm confused about how to modify the tree and then write the resulting tree into a new XML file. I've read the documentation on itertree but it hasn't cleared things up. Are there any simple ways to do this?
Thank you!
EDIT: Here's what I have so far.
import xml.etree.ElementTree as ET
import re
date_pages = []
f=open('dates_texts.xml', 'w+')
tree = ET.iterparse("sample.xml")
for i, element in tree:
if element.tag == 'page':
for page_element in element:
if page_element.tag == 'revision':
for revision_element in page_element:
if revision_element.tag == '{text':
if len(re.findall('20\d\d', revision_element.text.encode('utf8'))) == 0:
element.clear()
回答1:
If you have a large xml that doesn't fit in memory then you could try to serialize it one element at a time. For example, assuming <root><page/><page/><page/>...</root>
document structure and ignoring possible namespace issues:
import xml.etree.cElementTree as etree
def getelements(filename_or_file, tag):
context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
_, root = next(context) # get root element
for event, elem in context:
if event == 'end' and elem.tag == tag:
yield elem
root.clear() # free memory
with open('output.xml', 'wb') as file:
# start root
file.write(b'<root>')
for page in getelements('sample.xml', 'page'):
if keep(page):
file.write(etree.tostring(page, encoding='utf-8'))
# close root
file.write(b'</root>')
where keep(page)
returns True
if page
should be kept e.g.:
import re
def keep(page):
# all <revision> elements must have 20xx in them
return all(re.search(r'20\d\d', rev.text)
for rev in page.iterfind('revision'))
For comparison, to modify a small xml file, you could:
# parse small xml
tree = etree.parse('sample.xml')
# remove some root/page elements from xml
root = tree.getroot()
for page in root.findall('page'):
if not keep(page):
root.remove(page) # modify inplace
# write to a file modified xml tree
tree.write('output.xml', encoding='utf-8')
回答2:
Perhaps the answer to my similar question can help you out.
As for how to write this back to an .xml file, I ended up doing this at the bottom of my script:
with open('File.xml', 'w') as t: # I'd suggest using a different file name here than your original
for line in ET.tostring(doc):
t.write(line)
t.close
print('File.xml Complete') # Console message that file wrote successfully, can be omitted
The variable doc
is from earlier on in my script, comparable to where you have tree = ET.iterparse("sample.xml")
I have this:
doc = ET.parse(filename)
I've been using lxml instead of ElementTree but I think the write out part should still work (I think it's mainly just xpath stuff that ElementTree can't handle.) I'm using lxml imported with this line:
from lxml import etree as ET
Hopefully this (along with my linked question for some additional code context if you need it) can help you out!
来源:https://stackoverflow.com/questions/15399904/using-python-elementtrees-itertree-function-and-writing-modified-tree-to-output