Is there a fast XML parser in Python that allows me to get start of tag as byte offset in stream?

安稳与你 提交于 2019-12-21 18:28:22

问题


I am working with potentially huge XML files containing complex trace information from on of my projects.

I would like to build indexes for those XML files so that one can quickly find sub sections of the XML document without having to load it all into memory.

If I have created a "shelve" index that could contains information like "books for author Joe" are at offsets [22322, 35446, 54545] then I can just open the xml file like a regular text file and seek to those offsets and then had that to one of the DOM parser that takes a file or strings.

The part that I have not figured out yet is how to quickly parse the XML and create such an index.

So what I need as a fast SAX parser that allows me to find the start offset of tags in the file together with the start events. So I can parse a subsection of the XML together with the starting point into the document, extract the key information and store the key and offset in the shelve index.


回答1:


Since locators return line and column numbers in lieu of offset, you need a little wrapping to track line ends -- a simplified example (could have some offbyones;-)...:

import cStringIO
import re
from xml import sax
from xml.sax import handler

relinend = re.compile(r'\n')

txt = '''<foo>
            <tit>Bar</tit>
        <baz>whatever</baz>
     </foo>'''
stm = cStringIO.StringIO(txt)

class LocatingWrapper(object):
    def __init__(self, f):
        self.f = f
        self.linelocs = []
        self.curoffs = 0

    def read(self, *a):
        data = self.f.read(*a)
        linends = (m.start() for m in relinend.finditer(data))
        self.linelocs.extend(x + self.curoffs for x in linends)
        self.curoffs += len(data)
        return data

    def where(self, loc):
        return self.linelocs[loc.getLineNumber() - 1] + loc.getColumnNumber()

locstm = LocatingWrapper(stm)

class Handler(handler.ContentHandler):
    def setDocumentLocator(self, loc):
        self.loc = loc
    def startElement(self, name, attrs):
        print '%s@%s:%s (%s)' % (name, 
                                 self.loc.getLineNumber(),
                                 self.loc.getColumnNumber(),
                                 locstm.where(self.loc))

sax.parse(locstm, Handler())

Of course you don't need to keep all of the linelocs around -- to save memory, you can drop "old" ones (below the latest one queried) but then you need to make linelocs a dict, etc.



来源:https://stackoverflow.com/questions/3187964/is-there-a-fast-xml-parser-in-python-that-allows-me-to-get-start-of-tag-as-byte

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!