Parse large XML with lxml

后端 未结 1 552
滥情空心
滥情空心 2020-12-11 13:00

I am trying to get my script working. So far it doesn\'t managed to output anything.

This is my test.xml



        
相关标签:
1条回答
  • 2020-12-11 14:04

    You are parsing a namespaced document, and there is no 'page' tag present, because that only applies to tags without a namespace.

    You are instead looking for the '{http://www.mediawiki.org/xml/export-0.8/}page' element, which contains a '{http://www.mediawiki.org/xml/export-0.8/}ns' element.

    Many lxml methods do let you specify a namespace map to make matching easier, but the iterparse() method is not one of them, unfortunately.

    The following .iterparse() call certainly processes the right page tags:

    context = etree.iterparse('test.xml', events=('end',), tag='{http://www.mediawiki.org/xml/export-0.8/}page')
    

    but you'll need to use .find() to get the ns and title tags on the page element, or use xpath() calls to get the text directly:

    def process_element(elem):
        if elem.xpath("./*[local-name()='ns']/text()=0"):
            print elem.xpath("./*[local-name()='title']/text()")[0]
    

    which, for your input example, prints:

    >>> fast_iter(context, process_element)
    MediaWiki:Category
    
    0 讨论(0)
提交回复
热议问题