lxml attributes require full namespace

问题

The code below reads the a table from an Excel 2003 XML workbook using lxml (python 3.3). The code works fine, however in order to access the Type attribute of the Data element via the get() method I need to use the key '{urn:schemas-microsoft-com:office:spreadsheet}Type' - why is this, I've specified this namespace with the ss prefix.

All I can think of is this namespace appears twice in the document, once with a namespace prefix and once without i.e.

<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:o="urn:schemas-microsoft-com:office:office"
 xmlns:x="urn:schemas-microsoft-com:office:excel"
 xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
 xmlns:html="http://www.w3.org/TR/REC-html40">

And in the file the element and attribute are declared as below - The Type attribute with ss: prefix and the Cell and Data element with no prefix. However the declaration says both belong to the same schema 'urn:schemas-microsoft-com:office:spreadsheet' so surely the parser should treat them equivalently?

<Cell><Data ss:Type="String">QB11128020</Data></Cell>

My code:

with (open(filename,'r')) as f:
    doc = etree.parse(f)

namespaces={'o':'urn:schemas-microsoft-com:office:office',
            'x':'urn:schemas-microsoft-com:office:excel',
            'ss':'urn:schemas-microsoft-com:office:spreadsheet'}

ws = doc.xpath('/ss:Workbook/ss:Worksheet', namespaces=namespaces)
if len(ws) > 0: 
    tables = ws[0].xpath('./ss:Table', namespaces=namespaces)
    if len(tables) > 0: 
        rows = tables[0].xpath('./ss:Row', namespaces=namespaces)
        for row in rows:
            cells = row.xpath('./ss:Cell/ss:Data', namespaces=namespaces)
            for cell in cells:
                print(cell.text);
                print(cell.keys());
                print(cell.get('{urn:schemas-microsoft-com:office:spreadsheet}Type'));

回答1:

According to The lxml.etree Tutorial -- Namespace:

The ElementTree API avoids namespace prefixes wherever possible and deploys the real namespaces (the URI) instead:

BTW, following

cell.get('{urn:schemas-microsoft-com:office:spreadsheet}Type')

can be written as:

cell.get('{%(ss)s}Type' % namespaces)

or:

cell.get('{{{0[ss]}}}Type'.format(namespaces))

来源：https://stackoverflow.com/questions/20930059/lxml-attributes-require-full-namespace

标签

python

xml

excel

lxml