xml.etree.ElementTree get node depth

问题

The XML:

<?xml version="1.0"?>
<pages>
    <page>
        <url>http://example.com/Labs</url>
        <title>Labs</title>
        <subpages>
            <page>
                <url>http://example.com/Labs/Email</url>
                <title>Email</title>
                <subpages>
                    <page/>
                    <url>http://example.com/Labs/Email/How_to</url>
                    <title>How-To</title>
                </subpages>
            </page>
            <page>
                <url>http://example.com/Labs/Social</url>
                <title>Social</title>
            </page>
        </subpages>
    </page>
    <page>
        <url>http://example.com/Tests</url>
        <title>Tests</title>
        <subpages>
            <page>
                <url>http://example.com/Tests/Email</url>
                <title>Email</title>
                <subpages>
                    <page/>
                    <url>http://example.com/Tests/Email/How_to</url>
                    <title>How-To</title>
                </subpages>
            </page>
            <page>
                <url>http://example.com/Tests/Social</url>
                <title>Social</title>
            </page>
        </subpages>
    </page>
</pages>

The code:

// rexml is the XML string read from a URL
from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)
for node in tree.iter('page'):
    for url in node.iterfind('url'):
        print url.text
    for title in node.iterfind('title'):
        print title.text.encode("utf-8")
    print '-' * 30

The output:

http://example.com/article1
Article1
------------------------------
http://example.com/article1/subarticle1
SubArticle1
------------------------------
http://example.com/article2
Article2
------------------------------
http://example.com/article3
Article3
------------------------------

The Xml represents a tree like structure of a sitemap.

I have been up and down the docs and Google all day and can't figure it out hot to get the node depth of entries.

I used counting of the children container but that only works for the first parent and then it breaks as I can't figure it out how to reset. But this is probably just a hackish idea.

The desired output:

0
http://example.com/article1
Article1
------------------------------
1
http://example.com/article1/subarticle1
SubArticle1
------------------------------
0
http://example.com/article2
Article2
------------------------------
0
http://example.com/article3
Article3
------------------------------

回答1:

Used lxml.html.

import lxml.html

rexml = ...

def depth(node):
    d = 0
    while node is not None:
        d += 1
        node = node.getparent()
    return d

tree = lxml.html.fromstring(rexml)
for node in tree.iter('page'):
    print depth(node)
    for url in node.iterfind('url'):
        print url.text
    for title in node.iterfind('title'):
        print title.text.encode("utf-8")
    print '-' * 30

回答2:

The Python ElementTree API provides iterators for depth-first traversal of a XML tree - unfortunately, those iterators don't provide any depth information to the caller.

But you can write a depth-first iterator that also returns the depth information for each element:

import xml.etree.ElementTree as ET

def depth_iter(element, tag=None):
    stack = []
    stack.append(iter([element]))
    while stack:
        e = next(stack[-1], None)
        if e == None:
            stack.pop()
        else:
            stack.append(iter(e))
            if tag == None or e.tag == tag:
                yield (e, len(stack) - 1)

Note that this is more efficient than determining the depth via following the parent links (when using lxml) - i.e. it is O(n) vs. O(n log n).

回答3:

lxml is best for this, but if you have to use the standard library, do not use iter and go walking the tree, so you can know where you are.

from xml.etree import ElementTree as ET
tree = ET.fromstring(rexml)

def sub(node, tag):
    return node.findall(tag) or []

def print_page(node, depth):
    print "%s" % depth
    url = node.find("url")
    if url is not None:
        print url.text
    title = node.find("title")
    if title is not None:
        print title.text
    print '-' * 30

def find_pages(node, depth=0):
    for page in sub(node, "page"):
        print_page(page, depth)
        subpage = page.find("subpages")
        if subpage is not None:
            find_pages(subpage, depth+1)

find_pages(tree)

回答4:

This is another easy way of doing this without using an XML library:

depth = 0
for i in range(int(input())):
    tab = input().count('    ')
    if tab > depth:
        depth = tab
print(depth)

来源：https://stackoverflow.com/questions/17275524/xml-etree-elementtree-get-node-depth

标签

python

xml

xml-parsing