Most Pythonic way to find the sibling of an element in XML

问题

Problem: I have the following XML snippet:

...snip...
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
..snip...

I need to search the totality of the XML, find the heading that has text DEFINITION, and print the associated definitions. The format of the definitions is not consistent and can change attributes/elements so the only reliable way of capturing all of it is to read until the next element with attribute p_cat_heading.

Right now I am using the following code to find all of the headers:

for heading in root.findall(".//*[@class='p_cat_heading']"):
    if heading.text == "DEFINITION":
        <WE FOUND THE CORRECT HEADER - TAKE ACTION HERE>

Things I have tried:

Using lxml's getnext method. This gets the next sibling which has the attribute "p_cat_heading" which isn't what I want.
following_sibling - lxml is supposed to support this but it throws "following-sibling is not found in prefix-map"

My Solution:

I haven't finished it, but because my XML is short I was just going to get a list of all elements, iterate until the one with the DEFINITION attribute, and then iterate until the next element with the p_cat_heading attribute. This solution is horrible and ugly, but I can't seem to find a clean alternative.

What I'm looking for:

A more Pythonic way of printing the definition which is "this, these" in our case. Solution may use either xpath or some alternative. Python-native solutions preferred, but anything will do.

回答1:

There are a couple of ways of doing this, but by relying on xpath to do most of the work, this expression

//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]

should work.

Using lxml:

from lxml import html

data = [your snippet above]
exp = "//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]"

tree = html.fromstring(data) 
target = tree.xpath(exp)

for i in target:
    print(i.text_content())

Output:

This, these.

回答2:

You can use BeatifulSoup with CSS selectors for this task. The selector .p_cat_heading:contains("DEFINITION") ~ .p_cat_heading will select all elements with class p_cat_heading that are preceded by element with class p_cat_heading containing string "DEFINITION":

data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

for heading in soup.select('.p_cat_heading:contains("DEFINITION") ~ .p_cat_heading'):
    print(heading.text)

Prints:

PRONUNCIATION

Further reading

CSS Selector guide

EDIT:

To select direct sibling after the DEFINITION:

data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This is after DEFINITION</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
<p class="p_numberedbullet"><span class="calibre10">This is after PRONUNCIATION</span>, <span class="calibre10">these</span>. </p>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

s = soup.select_one('.p_cat_heading:contains("DEFINITION") + :not(.p_cat_heading)')
print(s.text)

Prints:

This is after DEFINITION, these.

来源：https://stackoverflow.com/questions/56904167/most-pythonic-way-to-find-the-sibling-of-an-element-in-xml

标签

python

xml

xpath