I\'m new to xml parsing and Python so bear with me. I\'m using lxml to parse a wiki dump, but I just want for each page, its title and text.
For now I\'ve got this:
First, you need to locate the parent element, page
. I don't know how many layers is this nested, but once you find it, you can immmidiately obtain the title
tag:
>>> page_tag = ET.fromstring(xdata)
>>> title_tag = page_tag.find('title')
>>> title_tag.text
'Aratrum'
With more information flooded in, you can do this:
def parser(file_name):
document = etree.parse(file_name)
titles = []
for page_tag in document.findall('page'):
titles.append(page_tag.find('title').text)
return titles
Hope this helps!
The problem is that you are not taking XML namespaces into account. The XML document (and all the elements in it) is in the http://www.mediawiki.org/xml/export-0.7/
namespace. To make it work, you need to change
titles = document.findall('.//title')
to
titles = document.findall('.//{http://www.mediawiki.org/xml/export-0.7/}title')
The namespace can also be provided via the namespaces
parameter:
NSMAP = {'mw':'http://www.mediawiki.org/xml/export-0.7/'}
titles = document.findall('.//mw:title', namespaces=NSMAP)
This works in Python 2.7, but it is not explained in the Python 2.7 documentation (the Python 3.3 documentation is better).
See also http://effbot.org/zone/element-namespaces.htm and this SO question with answer: Parsing XML with namespace in Python via 'ElementTree'.
The trouble with iterparse() is caused by the fact that this function provides (event, element)
tuples (not just elements). In order to get the tag name, change
for e in etree.iterparse(file_name):
print e.tag
to this:
for e in etree.iterparse(file_name):
print e[1].tag