Python lxml - using the xml:lang attribute to retrieve an element

喜欢而已 提交于 2019-12-07 19:03:35

问题


I have some xml which has multiple elements with the same name, but each is in a different language, for example:

<Title xml:lang="FR" type="main">Les Tudors</Title>
<Title xml:lang="DE" type="main">Die Tudors</Title>
<Title xml:lang="IT" type="main">The Tudors</Title>

Normally, I'd retrieve an element using its attributes as follows:

titlex = info.find('.//xmlns:Title[@someattribute=attributevalue]', namespaces=nsmap)

If I try and do this with [@xml:lang="FR"] (for example), I get the traceback error:

  File "D:/Python code/RBM CRID, Title, Genre/CRID, Title, Genre, Age rating, Episode Number, Descriptions V1.py", line 29, in <module>
    titlex = info.find('.//xmlns:Title[@xml:lang=PL]', namespaces=nsmap) 

  File "lxml.etree.pyx", line 1457, in lxml.etree._Element.find (src\lxml\lxml.etree.c:51435)

  File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 282, in find
    it = iterfind(elem, path, namespaces)

  File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 272, in iterfind
    selector = _build_path_iterator(path, namespaces)

  File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 256, in _build_path_iterator
    selector.append(ops[token[0]](_next, token))

  File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 134, in prepare_predicate
    token = next()

  File "C:\Python34\lib\site-packages\lxml\_elementpath.py", line 80, in xpath_tokenizer
    raise SyntaxError("prefix %r not found in prefix map" % prefix) SyntaxError: prefix 'xml' not found in prefix map

I'm not surprised by this, but I'd like suggestions on how to get around the issue.

Thanks!

As requested, a cut-down but complete set of code (It works as expected if I remove the [bitsinsquarebrackets]):

import lxml
import codecs

file_name = (input('Enter the file name, excluding .xml extension: ') + '.xml')# User inputs file name
print('Parsing ' + file_name)


#----- Sets up import and namespace

from lxml import etree

parser = lxml.etree.XMLParser()


tree = lxml.etree.parse(file_name, parser)                                 # Name of file to test goes here
root = tree.getroot()

nsmap = {'xmlns': 'urn:tva:metadata:2012',
         'mpeg7': 'urn:tva:mpeg7:2008'}

#----- This code writes the output to a file

with codecs.open(file_name+'.log', mode='w', encoding='utf-8') as f:                        # Name the output file
    f.write(u'CRID|Title|Genre|Rating|Short Synopsis|Medium Synopsis|Long Synopsis\n')
    for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
       titlex = info.find('.//xmlns:Title[xml:lang="PL"]', namespaces=nsmap)             # Retreve the title
       title = titlex.text if titlex != None else 'Missing'             # If there isn't a title, print an alternative word
       f.write(u'{}\n'.format(title))                     # Write all the retrieved values to the same line with bar seperators and a new line

回答1:


The xml prefix in xml:lang does not need to be declared in an XML document, but if you want to use xml:lang in XPath lookups, you have to define a prefix mapping in the Python code.

The xml prefix is reserved (as opposed to "normal" namespace prefixes which are arbitrary) and defined to be bound to http://www.w3.org/XML/1998/namespace. See the Namespaces in XML 1.0 W3C recommendation.

Example:

from lxml import etree

# Required mapping
nsmap = {"xml": "http://www.w3.org/XML/1998/namespace"}

XML = """
<root>
  <Title xml:lang="FR" type="main">Les Tudors</Title>
  <Title xml:lang="DE" type="main">Die Tudors</Title>
  <Title xml:lang="IT" type="main">The Tudors</Title>
</root>"""

doc = etree.fromstring(XML)

title_FR = doc.find('Title[@xml:lang="FR"]', namespaces=nsmap)
print title_FR.text

Output:

Les Tudors

If there is no mapping for the xml prefix, you get the "prefix 'xml' not found in prefix map" error. If the URI mapped to the xml prefix is not http://www.w3.org/XML/1998/namespace, the find method in the code snippet above does not return anything.




回答2:


If you have control over the xml file , you should change the xml:lang attribute to lang .

Or if you do not have that control , I would suggest adding xml in the nsmap, like -

nsmap = {'xmlns': 'urn:tva:metadata:2012',
         'mpeg7': 'urn:tva:mpeg7:2008',
         'xml': '<namespace>'}


来源:https://stackoverflow.com/questions/31250641/python-lxml-using-the-xmllang-attribute-to-retrieve-an-element

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!