问题
I have HTML like:
<tr>
<td>Title:</td>
<td>Title value</td>
</tr>
I have to specify after which <td>
with text i want to grab text of second <td>
. Something like: Grab text of first next <td>
after <td>
which contain text Title:
. Result should be: Title value
I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class
to specify.
I have tried this:
row = soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)
and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'
回答1:
First of all, soup.find_all()
returns a ResultSet
which contains all the elements with tag td
and string as Title:
.
For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td
, since you can get other elements in between (like a NavigableString)).
Example -
>>> from bs4 import BeautifulSoup
>>> s="""<tr>
... <td>Title:</td>
... <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row = soup.find_all('td', string='Title:')
>>> for r in row:
... nextSib = r.nextSibling
... while nextSib.name != 'td' and nextSib is not None:
... nextSib = nextSib.nextSibling
... print(nextSib.text)
...
Title value
Or you can use another library that has support for XPATH , and with Xpath you can do this easily. Other libraries like - lxml
or xml.etree
.
回答2:
What you're intending to do is relatively easier with lxml using xpath. You can try something like this,
from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
if path_list[i].text == '<What you want>' and i != len(path_list) :
your_text = path_list[i+1].text
来源:https://stackoverflow.com/questions/31638311/beautifulsoup-how-to-extract-text-after-specified-string