lxml attributes require full namespace

こ雲淡風輕ζ 提交于 2020-01-03 16:49:01

问题


The code below reads the a table from an Excel 2003 XML workbook using lxml (python 3.3). The code works fine, however in order to access the Type attribute of the Data element via the get() method I need to use the key '{urn:schemas-microsoft-com:office:spreadsheet}Type' - why is this, I've specified this namespace with the ss prefix.

All I can think of is this namespace appears twice in the document, once with a namespace prefix and once without i.e.

<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">

And in the file the element and attribute are declared as below - The Type attribute with ss: prefix and the Cell and Data element with no prefix. However the declaration says both belong to the same schema 'urn:schemas-microsoft-com:office:spreadsheet' so surely the parser should treat them equivalently?

<Cell><Data ss:Type="String">QB11128020</Data></Cell>

My code:

with (open(filename,'r')) as f:
    doc = etree.parse(f)

namespaces={'o':'urn:schemas-microsoft-com:office:office',
            'x':'urn:schemas-microsoft-com:office:excel',
            'ss':'urn:schemas-microsoft-com:office:spreadsheet'}

ws = doc.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0: 
    tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    if len(tables) > 0: 
        rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
        for row in rows:
            cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
            for cell in cells:
                print(cell.text);
                print(cell.keys());
                print(cell.get('{urn:schemas-microsoft-com:office:spreadsheet}Type'));

回答1:


According to The lxml.etree Tutorial -- Namespace:

The ElementTree API avoids namespace prefixes wherever possible and deploys the real namespaces (the URI) instead:


BTW, following

cell.get('{urn:schemas-microsoft-com:office:spreadsheet}Type')

can be written as:

cell.get('{%(ss)s}Type' % namespaces)

or:

cell.get('{{{0[ss]}}}Type'.format(namespaces))


来源:https://stackoverflow.com/questions/20930059/lxml-attributes-require-full-namespace

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!