问题
EDIT: Anyone coming to this in the future, the solution I used was to switch to cElementTree. It not only runs with less memory, it is significantly faster.
This works on files up to about 600mb in size, larger than that and I run out of memory (I have a 16gb machine). What can I do to read in a file in pieces, or read in a certain percentage of the xml at a time or is there a less memory intensive approach?
import csv
import xml.etree.ElementTree as ET
from lxml import etree
import time
import sys
def main(argv):
start_time = time.time()
#file_name = 'sample.xml'
file_name = argv
root = ET.ElementTree(file=file_name).getroot()
csv_file_name = '.'.join(file_name.split('.')[:-1]) + ".txt"
print '\n'
print 'Output file:'
print csv_file_name
with open(csv_file_name, 'w') as file_:
writer = csv.writer(file_, delimiter="\t")
header = [ <the names of the tags here> ]
writer.writerow(header)
tags = [
<bunch of xml tags here>
]
#write the values
# for index in range(8,1000):
for index in range(3,len(root)):
#print index
row=[]
for tagindex,val in enumerate(tags):
searchQuery = "tags"+tags[tagindex]
# print searchQuery
# print root[index]
# print root[index].find(searchQuery).text
if (root[index].find(searchQuery) is None) or (root[index].find(searchQuery).text == None):
row.extend([""])
#print tags[tagindex]+" blank"
else:
row.extend([root[index].find(searchQuery).text])
#print tags[tagindex]+" "+root[index].find(searchQuery).text
writer.writerow(row)
#for i,child in enumerate(root):
#print root[i]
print '\nNumber of elements is: %s' % len(root)
print '\nTotal run time: %s seconds' % (time.time() - start_time)
if __name__ == "__main__":
main(sys.argv[1])
回答1:
Use ElementTree.iterparse to parse your XML data. See documentation for help.
回答2:
Few hints:
- use
lxml
, it is very performant - use
iterparse
which can process your document piece by piece
However, iterparse
can surprise you and you might end up with high memory consumption. To overcome this trouble, you have to clear references to already processed items as described in my favourite article about effective lxml usage
Sample script fastiterparse.py
using optimized iterparse
Install docopt
and lxml
$ pip install lxml docopt
Write the script:
"""For all elements with given tag prints value of selected attribute
Usage:
fastiterparse.py <xmlfile> <tag> <attname>
fastiterparse.py -h
"""
from lxml import etree
from functools import partial
def fast_iter(context, func):
for event, elem in context:
func(elem)
elem.clear()
while elem.getprevious() is not None:
del elem.getparent()[0]
del context
def printattname(elem, attname):
print elem.attrib[attname]
def main(fname, tag, attname):
fun = partial(printattname, attname=attname)
with open(fname) as f:
context = etree.iterparse(f, events=("end",), tag=tag)
fast_iter(context, fun)
if __name__ == "__main__":
from docopt import docopt
args = docopt(__doc__)
main(args["<xmlfile>"], args["<tag>"], args["<attname>"])
Try to call it:
$ python fastiterparse.py
Usage:
fastiterparse.py <xmlfile> <tag> <attname>
fastiterparse.py -h
Use it (on your file):
$ python fastiterparse.py large.xml ElaboratedRecord id
rec26872
rec25887
rec26873
rec26874
Conclusion (use the fast_iter
approach)
Main takeaway is the fast_iter
function (or at least remembering to clear
unused elements, delete them and finally delete the context
Measurement can show, that in some cases the script runs a bit slower, then without clear
and del
, but the difference is not significant. The advantage comes at the moment memory is the limitation as when it starts swapping, optimized version will become faster, and if one runs out of memory, there are not many other options.
回答3:
Use cElementTree instead of ElementTree.
Replace your ET import statement by: import xml.etree.cElementTree as ET
来源:https://stackoverflow.com/questions/24126299/running-out-of-memory-using-python-elementtree