BeautifulSoup - How to extract text after specified string

问题

I have HTML like:

<tr>
    <td>Title:</td>
    <td>Title value</td>
</tr>

I have to specify after which <td> with text i want to grab text of second <td>. Something like: Grab text of first next <td> after <td> which contain text Title:. Result should be: Title value

I have some basic understanding of Python and BeutifulSoupno and i have no idea how can I do this when there is no class to specify.

I have tried this:

row =  soup.find_all('td', string='Title:')
text = str(row.nextSibling)
print(text)

and I receive error: AttributeError: 'ResultSet' object has no attribute 'nextSibling'

回答1:

First of all, soup.find_all() returns a ResultSet which contains all the elements with tag td and string as Title: .

For each such element in the result set , you will need to get the nextSibling separately (also, you should loop through until you find the nextSibling of tag td , since you can get other elements in between (like a NavigableString)).

Example -

>>> from bs4 import BeautifulSoup
>>> s="""<tr>
...     <td>Title:</td>
...     <td>Title value</td>
... </tr>"""
>>> soup = BeautifulSoup(s,'html.parser')
>>> row =  soup.find_all('td', string='Title:')
>>> for r in row:
...     nextSib = r.nextSibling
...     while nextSib.name != 'td' and nextSib is not None:
...             nextSib = nextSib.nextSibling
...     print(nextSib.text)
...
Title value

Or you can use another library that has support for XPATH , and with Xpath you can do this easily. Other libraries like - lxml or xml.etree .

回答2:

What you're intending to do is relatively easier with lxml using xpath. You can try something like this,

from lxml import etree
tree = etree.parse(<your file>)
path_list = tree.xpath('//<xpath to td>')
for i in range(0, len(path_list)) :
    if path_list[i].text == '<What you want>' and i != len(path_list) :
        your_text = path_list[i+1].text

来源：https://stackoverflow.com/questions/31638311/beautifulsoup-how-to-extract-text-after-specified-string

标签

python

python-3.x

beautifulsoup

extract