问题
I have the following html:
<div id = "big">
<span>header 1</span>
<ul id = "outer">
<li id = "inner">aaa</li>
<li id = "inner">bbb</li>
</ul>
<span>header 2</span>
<ul id = "outer">
<li id = "inner">ccc</li>
<li id = "inner">ddd</li>
</ul>
</div>
I want it to loop it in the order:
header 1
aaa
bbb
header 2
ccc
ddd
I have tried looping through each ul and then printing the header and the li values. However, I don't know how to get the span header that is associated with a ul.
sets = tree.xpath("//div[@id='big']//ul[@id='outer']")
for set in sets:
# Print header. Not sure how to get it
header = set.xpath(".//li/preceding-sibling::span")
print header
# Print texts. This works.
values = set.xpath(".//li//text()")
for v in values:
print v
Just looping all text nodes won't work because I need to know if it is a header or li value.
回答1:
This worked:
header = ingred_set.getprevious().xpath(".//text()")[0]
回答2:
For HTML use BeautifulSoup. It gives you access to previous and next siblings directly:
sibling_soup.b.next_sibling
# <c>text2</c>
sibling_soup.c.previous_sibling
# <b>text1</b>
Also, you can tell BS to use lxml
parser in constructor. From practice I can tell that lxml
performs better than the default html.parser
on ill-formatted input
来源:https://stackoverflow.com/questions/16262532/lxml-python-get-previous-sibling