问题

EDIT: Anyone coming to this in the future, the solution I used was to switch to cElementTree. It not only runs with less memory, it is significantly faster.

This works on files up to about 600mb in size, larger than that and I run out of memory (I have a 16gb machine). What can I do to read in a file in pieces, or read in a certain percentage of the xml at a time or is there a less memory intensive approach?

import csv
import xml.etree.ElementTree as ET
from lxml import etree
import time
import sys

def main(argv):
    start_time = time.time()

#file_name = 'sample.xml'
file_name = argv
root = ET.ElementTree(file=file_name).getroot() 
csv_file_name = '.'.join(file_name.split('.')[:-1]) + ".txt"
print '\n'
print 'Output file:'
print csv_file_name

with open(csv_file_name, 'w') as file_:
    writer = csv.writer(file_, delimiter="\t")
    header = [ <the names of the tags here> ]
    writer.writerow(header)
    tags = [
        <bunch of xml tags here>    
            ]

    #write the values
#     for index in range(8,1000):
    for index in range(3,len(root)):
        #print index
        row=[]
        for tagindex,val in enumerate(tags):
            searchQuery = "tags"+tags[tagindex]
#             print searchQuery
#             print root[index]
#             print root[index].find(searchQuery).text
            if (root[index].find(searchQuery) is None) or (root[index].find(searchQuery).text == None):
                row.extend([""])
                #print tags[tagindex]+" blank"
            else:
                row.extend([root[index].find(searchQuery).text])
                #print tags[tagindex]+" "+root[index].find(searchQuery).text
        writer.writerow(row)


    #for i,child in enumerate(root):
        #print root[i]
    print '\nNumber of elements is: %s' % len(root)

print '\nTotal run time: %s seconds' % (time.time() - start_time)

if __name__ == "__main__":
    main(sys.argv[1])

回答1:

Use ElementTree.iterparse to parse your XML data. See documentation for help.

回答2:

Few hints:

use lxml, it is very performant
use iterparse which can process your document piece by piece

However, iterparse can surprise you and you might end up with high memory consumption. To overcome this trouble, you have to clear references to already processed items as described in my favourite article about effective lxml usage

Sample script `fastiterparse.py` using optimized `iterparse`

Install docopt and lxml

$ pip install lxml docopt

Write the script:

"""For all elements with given tag prints value of selected attribute
Usage:
    fastiterparse.py <xmlfile> <tag> <attname>
    fastiterparse.py -h
"""
from lxml import etree
from functools import partial

def fast_iter(context, func):
    for event, elem in context:
        func(elem)
        elem.clear()
        while elem.getprevious() is not None:
            del elem.getparent()[0]
    del context

def printattname(elem, attname):
    print elem.attrib[attname]

def main(fname, tag, attname):

    fun = partial(printattname, attname=attname)
    with open(fname) as f:
        context = etree.iterparse(f, events=("end",), tag=tag)
        fast_iter(context, fun)

if __name__ == "__main__":
    from docopt import docopt
    args = docopt(__doc__)
    main(args["<xmlfile>"], args["<tag>"], args["<attname>"])

Try to call it:

$ python fastiterparse.py                                               
Usage:
    fastiterparse.py <xmlfile> <tag> <attname>
    fastiterparse.py -h

Use it (on your file):

$ python fastiterparse.py large.xml ElaboratedRecord id
rec26872
rec25887
rec26873
rec26874

Conclusion (use the `fast_iter` approach)

Main takeaway is the fast_iter function (or at least remembering to clear unused elements, delete them and finally delete the context

Measurement can show, that in some cases the script runs a bit slower, then without clear and del, but the difference is not significant. The advantage comes at the moment memory is the limitation as when it starts swapping, optimized version will become faster, and if one runs out of memory, there are not many other options.

回答3:

Use cElementTree instead of ElementTree.

Replace your ET import statement by: import xml.etree.cElementTree as ET

来源：https://stackoverflow.com/questions/24126299/running-out-of-memory-using-python-elementtree

标签

python

xml

elementtree

Running out of memory using python ElementTree

问题