Empty list returned from ElementTree findall

后端未结

关注

 2  1201

I\'m new to xml parsing and Python so bear with me. I\'m using lxml to parse a wiki dump, but I just want for each page, its title and text.

For now I\'ve got this:

相关标签:

2条回答

无人共我

2020-12-01 11:49
First, you need to locate the parent element, page. I don't know how many layers is this nested, but once you find it, you can immmidiately obtain the title tag:
```
>>> page_tag = ET.fromstring(xdata)
>>> title_tag = page_tag.find('title')
>>> title_tag.text
'Aratrum'
```
With more information flooded in, you can do this:
```
def parser(file_name):
    document = etree.parse(file_name)
    titles = []
    for page_tag in document.findall('page'):
        titles.append(page_tag.find('title').text)
    return titles
```
Hope this helps!
0 讨论(0)
发布评论:

提交评论
- 加载中...
再見小時候

2020-12-01 11:58
The problem is that you are not taking XML namespaces into account. The XML document (and all the elements in it) is in the http://www.mediawiki.org/xml/export-0.7/ namespace. To make it work, you need to change
```
titles = document.findall('.//title')
```
to
```
titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
```
The namespace can also be provided via the namespaces parameter:
```
NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)
```
This works in Python 2.7, but it is not explained in the Python 2.7 documentation (the Python 3.3 documentation is better).

See also http://effbot.org/zone/element-namespaces.htm and this SO question with answer: Parsing XML with namespace in Python via 'ElementTree'.

The trouble with iterparse() is caused by the fact that this function provides (event, element) tuples (not just elements). In order to get the tag name, change
```
for e in etree.iterparse(file_name):
    print e.tag
```
to this:
```
for e in etree.iterparse(file_name):
    print e[1].tag
```
0 讨论(0)
发布评论:

提交评论
- 加载中...