Efficient way of XML parsing in ElementTree(1.3.0) Python

淺唱寂寞╮ 提交于 2019-11-30 14:47:10

Here's a script that parses one million <instrumentConfiguration/> elements (967MB file) in 40 seconds (on my machine) without consuming large amount of memory.

The throughput is 24MB/s. The cElementTree page (2005) reports 47MB/s.

#!/usr/bin/env python
from itertools import imap, islice, izip
from operator  import itemgetter
from xml.etree import cElementTree as etree

def parsexml(filename):
    it = imap(itemgetter(1),
              iter(etree.iterparse(filename, events=('start',))))
    root = next(it) # get root element
    for elem in it:
        if elem.tag == '{http://psi.hupo.org/ms/mzml}instrumentConfiguration':
            values = [('Id', elem.get('id')),
                      ('Parameter1', next(it).get('name'))] # cvParam
            componentList_count = int(next(it).get('count'))
            for parent, child in islice(izip(it, it), componentList_count):
                key = parent.tag.partition('}')[2]
                value = child.get('name')
                assert child.tag.endswith('cvParam')
                values.append((key, value))
            yield values
            root.clear() # preserve memory

def print_values(it):
    for line in (': '.join(val) for conf in it for val in conf):
        print(line)

print_values(parsexml(filename))

Output

$ /usr/bin/time python parse_mxml.py
Id: QTOF
Parameter1: Q-Tof ultima
source: nanoelectrospray
analyzer: quadrupole
analyzer: time-of-flight
detector: microchannel plate detector
38.51user 1.16system 0:40.09elapsed 98%CPU (0avgtext+0avgdata 23360maxresident)k
1984784inputs+0outputs (2major+1634minor)pagefaults 0swaps

Note: The code is fragile it assumes that the first two children of <instrumentConfiguration/> are <cvParam/> and <componentList/> and all values are available as tag names or attributes.

On performance

ElementTree 1.3 is ~6 times slower than cElementTree 1.0.6 in this case.

If you replace root.clear() by elem.clear() then the code is ~10% faster but ~10 times more memory. lxml.etree works with elem.clear() variant, the performance is the same as for cElementTree but it consumes 20 (root.clear()) / 2 (elem.clear()) times as much memory (500MB).

If this is still a current issue, you might try pymzML, a python Interface to mzML Files. Website: http://pymzml.github.com/

In this case I would get findall to find all the instrumentList elements. Then on those results just access the data as if instrumentList and instrument were arrays, you get all the elements and don't have to search for them all.

If your files are huge, have a look at the iterparse() function. Be sure to read this article by elementtree's author, especially the part about "incremental parsing".

I know that this is old, but I run into this issue while doing XML parsing, where my XML files where really large.

J.F. Sebastian's answer is indeed correct, but the following issue came up.

What I noticed, is that sometimes the values in elem.text ( if you have values inside XML and not as attributes) are not read correctly (sometimes None is returned) if you iterate through the start attributes. I had to iterate through the 'end' like this

it = imap(itemgetter(1),
          iter(etree.iterparse(filename, events=('end',))))
root = next(it) # get root element

If someone wants to get the text inside an xml tag (and not an attribute) maybe he should iterate through the 'end' events and not 'start'.

However, if all the values are in attributes, then the code in J.F. Sebastian's answer is more correct.

XML example for my case:

<data>
<country>
    <name>Liechtenstein</name>
    <rank>1</rank>
    <year>2008</year>
    <gdppc>141100</gdppc>
</country>
<country>
    <name>Singapore</name>
    <rank>4</rank>
    <year>2011</year>
    <gdppc>59900</gdppc>
</country>
<country>
    <name>Panama</name>
    <rank>68</rank>
    <year>2011</year>
    <gdppc>13600</gdppc>
</country>

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!