问题
Problem: I have the following XML snippet:
...snip...
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
..snip...
I need to search the totality of the XML, find the heading that has text DEFINITION
, and print the associated definitions. The format of the definitions is not consistent and can change attributes/elements so the only reliable way of capturing all of it is to read until the next element with attribute p_cat_heading
.
Right now I am using the following code to find all of the headers:
for heading in root.findall(".//*[@class='p_cat_heading']"):
if heading.text == "DEFINITION":
<WE FOUND THE CORRECT HEADER - TAKE ACTION HERE>
Things I have tried:
- Using lxml's getnext method. This gets the next sibling which has the attribute "p_cat_heading" which isn't what I want.
- following_sibling - lxml is supposed to support this but it throws "following-sibling is not found in prefix-map"
My Solution:
I haven't finished it, but because my XML is short I was just going to get a list of all elements, iterate until the one with the DEFINITION attribute, and then iterate until the next element with the p_cat_heading attribute. This solution is horrible and ugly, but I can't seem to find a clean alternative.
What I'm looking for:
A more Pythonic way of printing the definition which is "this, these" in our case. Solution may use either xpath or some alternative. Python-native solutions preferred, but anything will do.
回答1:
There are a couple of ways of doing this, but by relying on xpath to do most of the work, this expression
//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]
should work.
Using lxml:
from lxml import html
data = [your snippet above]
exp = "//*[@class='p_cat_heading'][contains(text(),'DEFINITION')]/following-sibling::*[1]"
tree = html.fromstring(data)
target = tree.xpath(exp)
for i in target:
print(i.text_content())
Output:
This, these.
回答2:
You can use BeatifulSoup with CSS selectors for this task. The selector .p_cat_heading:contains("DEFINITION") ~ .p_cat_heading
will select all elements with class p_cat_heading
that are preceded by element with class p_cat_heading
containing string "DEFINITION":
data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for heading in soup.select('.p_cat_heading:contains("DEFINITION") ~ .p_cat_heading'):
print(heading.text)
Prints:
PRONUNCIATION
Further reading
CSS Selector guide
EDIT:
To select direct sibling after the DEFINITION:
data = '''
<p class="p_cat_heading">THIS YOU DONT WANT</p>
<p class="p_numberedbullet"><span class="calibre10">This</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">DEFINITION</p>
<p class="p_numberedbullet"><span class="calibre10">This is after DEFINITION</span>, <span class="calibre10">these</span>. </p>
<p class="p_cat_heading">PRONUNCIATION </p>
<p class="p_numberedbullet"><span class="calibre10">This is after PRONUNCIATION</span>, <span class="calibre10">these</span>. </p>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
s = soup.select_one('.p_cat_heading:contains("DEFINITION") + :not(.p_cat_heading)')
print(s.text)
Prints:
This is after DEFINITION, these.
来源:https://stackoverflow.com/questions/56904167/most-pythonic-way-to-find-the-sibling-of-an-element-in-xml