parsing xml by python lxml tree.xpath

问题

I try to parse a huge file. The sample is below. I try to take <Name>, but I can't It works only without this string

<LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">

xml2 = '''<?xml version="1.0" encoding="UTF-8"?>
<PackageLevelLayout>
<LevelLayouts>
    <LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432">
                <LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
                    <LevelLayoutSectionBase>
                        <LevelLayoutItemBase>
                            <Name>Tracking ID</Name>
                        </LevelLayoutItemBase>
                    </LevelLayoutSectionBase>
                </LevelLayout>
            </LevelLayout>
    </LevelLayouts>
</PackageLevelLayout>'''

from lxml import etree
tree = etree.XML(xml2)
nodes = tree.xpath('/PackageLevelLayout/LevelLayouts/LevelLayout[@levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]/LevelLayout/LevelLayoutSectionBase/LevelLayoutItemBase/Name')
print nodes

回答1:

Your nested LevelLayout XML document uses a namespace. I'd use:

tree.xpath('.//LevelLayout[@levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]//*[local-name()="Name"]')

to match the Name element with a shorter XPath expression (ignoring the namespace altogether).

The alternative is to use a prefix-to-namespace mapping and use those on your tags:

nsmap = {'acd': 'http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain'}

tree.xpath('/PackageLevelLayout/LevelLayouts/LevelLayout[@levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]/acd:LevelLayout/acd:LevelLayoutSectionBase/acd:LevelLayoutItemBase/acd:Name',
    namespaces=nsmap)

回答2:

lxml's xpath method has a namespaces parameter. You can pass it a dict mapping namespace prefixes to namespaces. Then you can refer build XPaths that use the namespace prefix:

xml2 = '''<?xml version="1.0" encoding="UTF-8"?>
<PackageLevelLayout>
<LevelLayouts>
    <LevelLayout levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432">
                <LevelLayout xmlns="http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain" xmlns:i="http://www.w3.org/2001/XMLSchema-instance">
                    <LevelLayoutSectionBase>
                        <LevelLayoutItemBase>
                            <Name>Tracking ID</Name>
                        </LevelLayoutItemBase>
                    </LevelLayoutSectionBase>
                </LevelLayout>
            </LevelLayout>
    </LevelLayouts>
</PackageLevelLayout>'''

namespaces={'ns': 'http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain',
            'i': 'http://www.w3.org/2001/XMLSchema-instance'}

import lxml.etree as ET
# This is an lxml.etree._Element, not a tree, so don't call it tree
root = ET.XML(xml2)

nodes = root.xpath(
    '''/PackageLevelLayout/LevelLayouts/LevelLayout[@levelGuid="4a54f032-325e-4988-8621-2cb7b49d8432"]
       /ns:LevelLayout/ns:LevelLayoutSectionBase/ns:LevelLayoutItemBase/ns:Name''', namespaces = namespaces)
print nodes

yields

[<Element {http://schemas.datacontract.org/2004/07/ArcherTech.Common.Domain}Name at 0xb74974dc>]

来源：https://stackoverflow.com/questions/15576706/parsing-xml-by-python-lxml-tree-xpath

标签

python

xml

xpath

lxml