Non-Blocking method for parsing (streaming) XML in python

*爱你&永不变心* 提交于 2019-11-29 04:23:56

Diving into the iterparse source provided the solution for me. Here's a simple example of building an XML tree on the fly and processing elements after their close tags:

import xml.etree.ElementTree as etree

parser = etree.XMLTreeBuilder()

def end_tag_event(tag):
    node = self.parser._end(tag)
    print node

parser._parser.EndElementHandler = end_tag_event

def data_received(data):
    parser.feed(data)

In my case I ended up feeding it data from twisted, but it should work with a non-blocking socket also.

I think there are two components to this, the non-blocking network I/O, and a stream-oriented XML parser.

For the former, you'd have to pick a non-blocking network framework, or roll your own solution for this. Twisted certainly would work, but I personally find inversion of control frameworks difficult to wrap my brain around. You would likely have to keep track of a lot of state in your callbacks to feed the parser. For this reason I tend to find Eventlet a bit easier to program to, and I think it would fit well in this situation.

Essentially it allows you to write your code as if you were using a blocking socket call (using an ordinary loop or a generator or whatever you like), except that you can spawn it into a separate coroutine (a "greenlet") that will automatically perform a cooperative yield when I/O operations would block, thus allowing other coroutines to run.

This makes using any stream-oriented parser trivial again, because the code is structured like an ordinary blocking call. It also means that many libraries that don't directly deal with sockets or other I/O (like the parser for instance) don't have to be specially modified to be non-blocking: if they block, Eventlet yields the coroutine.

Admittedly Eventlet is slightly magic, but I find it has a much easier learning curve than Twisted, and results in more straightforward code because you don't have to turn your logic "inside out" to fit the framework.

If you won't use threads, you can use an event loop and poll non-blocking sockets.

asyncore is the standard library module for such stuff. Twisted is the async library for Python, but complex and probably a bit heavyweight for your needs.

Alternatively, multiprocessing is the non-thread thread alternative, but I assume you aren't running 2.6.

One way or the other, I think you're going to have to use threads, extra processes or weave some equally complex async magic.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!