问题
Given the following (simplified from a larger document)
<tr class="row-class">
<td>Age</td>
<td>16</td>
</tr>
<tr class="row-class">
<td>Height</td>
<td>5.6</td>
</tr>
<tr class="row-class">
<td>Weight</td>
<td>103.4</td>
</tr>
I have tried to return the 16 from the appropriate row using bs4 and lxml. The issue seems to be that there is a Navigable String between the two td tags, so that
page.find_all("tr", {"class":"row-class"})
yields a result set with
result[0] = {Tag} <tr class="row-class"> <td>Age</td> <td>16</td> </tr>
result[1] = {Tag} <tr class="row-class"> <td>Height</td> <td>5.6</td> </tr>
result[2] = {Tag} <tr class="row-class"> <td>Weight</td> <td>103.4</td> </tr>
which is great, but I can't get the string in the second td. The contents of each of these rows is similar to
[' ', <td>Age</td>, ' ', <td>16</td>, ' ']
with the td being a tag and the ' ' being a Navigable String. This difference is preventing me from using the next_element or next_sibling convenience methods to access the correct text with something like:
If I use:
find("td", text=re.compile(r'Age')).get_text()
I get Age. But if I try to access the next element via
find("td", text=re.compile(r'Age')).next_element()
I get
'NavigableString' object is not callable
Because of the wrapping NavigableStrings in the result, moving backwards with previous_element has the same problem.
How do I move from the found Tag to the next Tag, skipping the next_element in between? Is there a way to remove these ' ' from the result?
I should point out that I've already tried to be pragmatic with something like:
for r in (sp.find_all("tr", {"class":"row-class"})):
age = r.find("td", text=re.compile(r"\d\d")).get_text()
it works ... until I parse a document that has another order with a matching \d\d before Age.
I know, also, that I can
find("td", text=re.compile(r'Age')).next_sibling.next_sibling
but that is hard-baking the structure in.
So I need to be specific in the search and find the td that has the target string, then find the value in the next td. I know I could build some logic that tests each row, but it seems like I'm missing something obvious and more elegant...
回答1:
if you get list of elements then you can use [index] to get element from list.
data = """<tr class="row-class">
<td>Age</td>
<td>16</td>
</tr>
<tr class="row-class">
<td>Height</td>
<td>5.6</td>
</tr>
<tr class="row-class">
<td>Weight</td>
<td>103.4</td>
</tr>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data)
trs = soup.find_all("tr", {"class":"row-class"})
for tr in trs:
tds = tr.find_all("td") # you get list
print('text:', tds[0].get_text()) # get element [0] from list
print('value:', tds[1].get_text()) # get element [1] from list
result
text: Age
value: 16
text: Height
value: 5.6
text: Weight
value: 103.4
来源:https://stackoverflow.com/questions/35050496/beautifulsoup-find-the-next-specific-tag-following-a-found-tag