Iteratively parse a large XML file without using the DOM approach

问题

I have an xml file

<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  .
  .
  <email id="998349883487454359203" Body="hi"/>
</temp>

I want to read the xml file for each email tag. That is, at a time I want to read email id=1..extract body from it, the read email id=2...and extract body from it...and so on

I tried to do this using DOM model for XML parsing, since my file size is 100 GB..the approach does not work. I then tried using:

  from xml.etree import ElementTree as ET
  tree=ET.parse('myfile.xml')
  root=ET.parse('myfile.xml').getroot()
  for i in root.findall('email/'):
              print i.get('Body')

Now once I get the root..I am not getting why is my code not been able to parse.

The code upon using iterparse is throwing the following error:

 "UnicodeEncodeError: 'ascii' codec can't encode character u'\u20ac' in position 437: ordinal not in range(128)"

Can somebody help

回答1:

An example for iterparse:

import cStringIO
from xml.etree.ElementTree import iterparse

fakefile = cStringIO.StringIO("""<temp>
  <email id="1" Body="abc"/>
  <email id="2" Body="fre"/>
  <email id="998349883487454359203" Body="hi"/>
</temp>
""")
for _, elem in iterparse(fakefile):
    if elem.tag == 'email':
        print elem.attrib['id'], elem.attrib['Body']
    elem.clear()

Just replace fakefile with your real file. Also read this for further details.

来源：https://stackoverflow.com/questions/10040444/iteratively-parse-a-large-xml-file-without-using-the-dom-approach

标签

python

xml

xml-parsing

lxml