xml.sax parser and line numbers etc

心已入冬 提交于 2020-01-23 10:38:29

问题


The task is to parse a simple XML document, and analyze the contents by line number.

The right Python package seems to be xml.sax. But how do I use it?

After some digging in the documentation, I found:

  • The xmlreader.Locator interface has the information: getLineNumber().
  • The handler.ContentHandler interface has setDocumentHandler().

The first thought would be to create a Locator, pass this to the ContentHandler, and read the information off the Locator during calls to its character() methods, etc.

BUT, xmlreader.Locator is only a skeleton interface, and can only return -1 from any of its methods. So as a poor user, WHAT am I to do, short of writing a whole Parser and Locator of my own??

I'll answer my own question presently.


(Well I would have, except for the arbitrary, annoying rule that says I can't.)


I was unable to figure this out using the existing documentation (or by web searches), and was forced to read the source code for xml.sax(under /usr/lib/python2.7/xml/sax/ on my system).

The xml.sax function make_parser() by default creates a real Parser, but what kind of thing is that?
In the source code one finds that it is an ExpatParser, defined in expatreader.py. And...it has its own Locator, an ExpatLocator. But, there is no access to this thing. Much head-scratching came between this and a solution.

  1. write your own ContentHandler, which knows about a Locator, and uses it to determine line numbers
  2. create an ExpatParser with xml.sax.make_parser()
  3. create an ExpatLocator, passing it the ExpatParser instance.
  4. make the ContentHandler, giving it this ExpatLocator
  5. pass the ContentHandler to the parser's setContentHandler()
  6. call parse() on the Parser.

For example:

import sys
import xml.sax

class EltHandler( xml.sax.handler.ContentHandler ):
    def __init__( self, locator ):
        xml.sax.handler.ContentHandler.__init__( self )
        self.loc = locator
        self.setDocumentLocator( self.loc )

    def startElement( self, name, attrs ): pass

    def endElement( self, name ): pass

    def characters( self, data ):
        lineNo = self.loc.getLineNumber()
        print >> sys.stdout, "LINE", lineNo, data

def spit_lines( filepath ):
    try:
        parser = xml.sax.make_parser()
        locator = xml.sax.expatreader.ExpatLocator( parser )
        handler = EltHandler( locator )
        parser.setContentHandler( handler )
        parser.parse( filepath )
    except IOError as e:
        print >> sys.stderr, e

if len( sys.argv ) > 1:
    filepath = sys.argv[1]
    spit_lines( filepath )
else:
    print >> sys.stderr, "Try providing a path to an XML file."

Martijn Pieters points out below another approach with some advantages. If the superclass initializer of the ContentHandler is properly called, then it turns out a private-looking, undocumented member ._locator is set, which ought to contain a proper Locator.

Advantage: you don't have to create your own Locator (or find out how to create it). Disadvantage: it's nowhere documented, and using an undocumented private variable is sloppy.

Thanks Martijn!


回答1:


The sax parser itself is supposed to provide your content handler with a locator. The locator has to implement certain methods, but it can be any object as long as it has the right methods. The xml.sax.xmlreader.Locator class is the interface a locator is expected to implement; if the parser provided a locator object to your handler then you can count on those 4 methods being present on the locator.

The parser is only encouraged to set a locator, it is not required to do so. The expat XML parser does provide it.

If you subclass xml.sax.handler.ContentHandler() then it'll provide a standard setDocumentHandler() method for you, and by the time .startDocument() on the handler is called your content handler instance will have self._locator set:

from xml.sax.handler import ContentHandler

class MyContentHandler(ContentHandler):
    def __init__(self):
        ContentHandler.__init__(self)
        # initialize your handler

    def startElement(self, name, attrs):
        loc = self._locator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)



回答2:


This is an old question, but I think that there is a better answer to it than the one given, so I'm going to add another answer anyway.

While there may indeed be an undocumented private data member named _locator in the ContentHandler superclass, as described in the above answer by Martijn, accessing location information using this data member does not appear to me to be the intended use of the location facilities.

In my opinion, Steve White raises good questions about why this member is not documented. I think the answer to those questions is that it was probably not intended to be for public use. It appears to be a private implementation detail of the ContentHandler superclass. Since it is an undocumented private implementation detail, it could disappear without warning with any future release of the SAX library, so relying on it could be dangerous.

It appears to me, from reading the documentation for the ContentHandler class, and specifically the documentation for ContentHandler.setDocumentLocator, that the designers intended for users to instead override the ContentHandler.setDocumentLocator function so that when the parser calls it, the user's content handler subclass can save a reference to the passed-in locator object (which was created by the SAX parser), and can later use that saved object to get location information. For example:

class MyContentHandler(ContentHandler):
    def __init__(self):
        super().__init__()
        self._mylocator = None
        # initialize your handler

    def setDocumentLocator(self, locator):
        self._mylocator = locator

    def startElement(self, name, attrs):
        loc = self._mylocator
        if loc is not None:
            line, col = loc.getLineNumber(), loc.getColumnNumber()
        else:
            line, col = 'unknown', 'unknown'
        print 'start of {} element at line {}, column {}'.format(name, line, col)

With this approach, there is no need to rely on undocumented fields.



来源:https://stackoverflow.com/questions/15477363/xml-sax-parser-and-line-numbers-etc

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!