Python has several ways to parse XML...
I understand the very basics of parsing with SAX. It functions as a stream parser, with an event-driven API
Link.
Python supplies a full, W3C-standard implementation of XML DOM (xml.dom) and a minimal one, xml.dom.minidom. This latter one is simpler and smaller than the full implementation. However, from a "parsing perspective", it has all the pros and cons of the standard DOM - i.e. it loads everything in memory.
Considering a basic XML file:
A1
T1
A2
T2
A possible Python parser using minidom is:
import os
from xml.dom import minidom
from xml.parsers.expat import ExpatError
#-------- Select the XML file: --------#
#Current file name and directory:
curpath = os.path.dirname( os.path.realpath(__file__) )
filename = os.path.join(curpath, "sample.xml")
#print "Filename: %s" % (filename)
#-------- Parse the XML file: --------#
try:
#Parse the given XML file:
xmldoc = minidom.parse(filepath)
except ExpatError as e:
print "[XML] Error (line %d): %d" % (e.lineno, e.code)
print "[XML] Offset: %d" % (e.offset)
raise e
except IOError as e:
print "[IO] I/O Error %d: %s" % (e.errno, e.strerror)
raise e
else:
catalog = xmldoc.documentElement
books = catalog.getElementsByTagName("book")
for book in books:
print book.getAttribute('isdn')
print book.getElementsByTagName('author')[0].firstChild.data
print book.getElementsByTagName('title')[0].firstChild.data
Note that xml.parsers.expat is a Python interface to the Expat non-validating XML parser (docs.python.org/2/library/pyexpat.html).
The xml.dom package supplies also the exception class DOMException, but it is not supperted in minidom!
Link.
ElementTree is much easier to use and it requires less memory than XML DOM. Furthermore, a C implementation is available (xml.etree.cElementTree).
A possible Python parser using ElementTree is:
import os
from xml.etree import cElementTree # C implementation of xml.etree.ElementTree
from xml.parsers.expat import ExpatError # XML formatting errors
#-------- Select the XML file: --------#
#Current file name and directory:
curpath = os.path.dirname( os.path.realpath(__file__) )
filename = os.path.join(curpath, "sample.xml")
#print "Filename: %s" % (filename)
#-------- Parse the XML file: --------#
try:
#Parse the given XML file:
tree = cElementTree.parse(filename)
except ExpatError as e:
print "[XML] Error (line %d): %d" % (e.lineno, e.code)
print "[XML] Offset: %d" % (e.offset)
raise e
except IOError as e:
print "[XML] I/O Error %d: %s" % (e.errno, e.strerror)
raise e
else:
catalogue = tree.getroot()
for book in catalogue:
print book.attrib.get("isdn")
print book.find('author').text
print book.find('title').text