Parse large XML with lxml

蓝咒 提交于 2019-11-28 13:03:07

You are parsing a namespaced document, and there is no 'page' tag present, because that only applies to tags without a namespace.

You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page' element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns' element.

Many lxml methods do let you specify a namespace map to make matching easier, but the iterparse() method is not one of them, unfortunately.

The following .iterparse() call certainly processes the right page tags:

context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')

but you'll need to use .find() to get the ns and title tags on the page element, or use xpath() calls to get the text directly:

def process_element(elem):
    if elem.xpath("./*[local-name()='ns']/text()=0"):
        print elem.xpath("./*[local-name()='title']/text()")[0]

which, for your input example, prints:

>>> fast_iter(context, process_element)
MediaWiki:Category
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!