问题
I am working with potentially huge XML files containing complex trace information from on of my projects.
I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.
If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.
The part that I have not figured out yet is how to quickly parse the XML and create such an index.
So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.
回答1:
Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:
import cStringIO
import re
from xml import sax
from xml.sax import handler
relinend = re.compile(r'\n')
txt = '''<foo>
<tit>Bar</tit>
<baz>whatever</baz>
</foo>'''
stm = cStringIO.StringIO(txt)
class LocatingWrapper(object):
def __init__(self, f):
self.f = f
self.linelocs = []
self.curoffs = 0
def read(self, *a):
data = self.f.read(*a)
linends = (m.start() for m in relinend.finditer(data))
self.linelocs.extend(x + self.curoffs for x in linends)
self.curoffs += len(data)
return data
def where(self, loc):
return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()
locstm = LocatingWrapper(stm)
class Handler(handler.ContentHandler):
def setDocumentLocator(self, loc):
self.loc = loc
def startElement(self, name, attrs):
print '%s@%s:%s (%s)' % (name,
self.loc.getLineNumber(),
self.loc.getColumnNumber(),
locstm.where(self.loc))
sax.parse(locstm, Handler())
Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.
来源:https://stackoverflow.com/questions/3187964/is-there-a-fast-xml-parser-in-python-that-allows-me-to-get-start-of-tag-as-byte