Empty list returned from ElementTree findall

后端 未结 2 1159
我在风中等你
我在风中等你 2020-12-01 11:15

I\'m new to xml parsing and Python so bear with me. I\'m using lxml to parse a wiki dump, but I just want for each page, its title and text.

For now I\'ve got this:

相关标签:
2条回答
  • 2020-12-01 11:49

    First, you need to locate the parent element, page. I don't know how many layers is this nested, but once you find it, you can immmidiately obtain the title tag:

    >>> page_tag = ET.fromstring(xdata)
    >>> title_tag = page_tag.find('title')
    >>> title_tag.text
    'Aratrum'
    

    With more information flooded in, you can do this:

    def parser(file_name):
        document = etree.parse(file_name)
        titles = []
        for page_tag in document.findall('page'):
            titles.append(page_tag.find('title').text)
        return titles
    

    Hope this helps!

    0 讨论(0)
  • 2020-12-01 11:58

    The problem is that you are not taking XML namespaces into account. The XML document (and all the elements in it) is in the http://www.mediawiki.org/xml/export-0.7/ namespace. To make it work, you need to change

    titles = document.findall('.//title')
    

    to

    titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
    

    The namespace can also be provided via the namespaces parameter:

    NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
    titles = document.findall('.//mw:title', namespaces=NSMAP)
    

    This works in Python 2.7, but it is not explained in the Python 2.7 documentation (the Python 3.3 documentation is better).

    See also http://effbot.org/zone/element-namespaces.htm and this SO question with answer: Parsing XML with namespace in Python via 'ElementTree'.


    The trouble with iterparse() is caused by the fact that this function provides (event, element) tuples (not just elements). In order to get the tag name, change

    for e in etree.iterparse(file_name):
        print e.tag
    

    to this:

    for e in etree.iterparse(file_name):
        print e[1].tag
    
    0 讨论(0)
提交回复
热议问题