My input file is actually multiple XML files appending to one file. (It\'s from Google Patents). It has below structure:
I'd opt for parsing each chunk of XML separately.
You seem to already be doing that in your sample code. Here's my take on your code:
def parse_xml_buffer(buffer):
dom = minidom.parseString("".join(buffer)) # join list into string of XML
# .... parse dom ...
buffer = [file.readline()] # initialise with the first line
for line in file:
if line.startswith("
Once you've broken the file down to individual XML blocks, how you actually do the parsing depends on your requirements and, to some extent, your preference. Options are lxml, minidom, elementtree, expat, BeautifulSoup, etc.
Starting from scratch, here's how I would do it (using BeautifulSoup):
#!/usr/bin/env python
from BeautifulSoup import BeautifulSoup
def separated_xml(infile):
file = open(infile, "r")
buffer = [file.readline()]
for line in file:
if line.startswith("
This returns:
D0629996
29316765
D471343
D475175
6715152
D498899
D558952
D571528
D577177
D584027
.... (lots more)...