BeautifulSoup, extracting strings within HTML tags, ResultSet objects

帅比萌擦擦* 提交于 2020-01-23 08:40:26

问题


I am confused exactly how I can use the ResultSet object with BeautifulSoup, i.e. bs4.element.ResultSet.

After using find_all(), how can one extract text?

Example:

In the bs4 documentation, the HTML document html_doc looks like:

<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>

One begins by creating the soup and finding all href,

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find_all('a')

which outputs

 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, 
  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

We could also do

for link in soup.find_all('a'):
    print(link.get('href'))

which outputs

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

I would like to get only the text from the class_="sister", i.e.

Elsie
Lacie
Tillie

One could try

for link in soup.find_all('a'):
    print(link.get_text())

but this results in an error:

AttributeError: 'ResultSet' object has no attribute 'get_text'

回答1:


Do a find_all() filtering on class_='sister'.

Note: Notice the underscore after class. It's a special case because class is a reserved word.

It’s very useful to search for a tag that has a certain CSS class, but the name of the CSS attribute, “class”, is a reserved word in Python. Using class as a keyword argument will give you a syntax error. As of Beautiful Soup 4.1.2, you can search by CSS class using the keyword argument class_:

Source: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class

Once you have all of the tags with class sister, call .text on them to get the text. Be sure to strip the text.

For example:

from bs4 import BeautifulSoup

html_doc = '''<p class="story">
    Once upon a time there were three little sisters; and their names were
    <a class="sister" href="http://example.com/elsie" id="link1">
     Elsie
    </a>
    ,
    <a class="sister" href="http://example.com/lacie" id="link2">
     Lacie
    </a>
    and
    <a class="sister" href="http://example.com/tillie" id="link2">
     Tillie
    </a>
    ; and they lived at the bottom of a well.
   </p>'''

soup = BeautifulSoup(html_doc, 'html.parser')
sistertags = soup.find_all(class_='sister')
for tag in sistertags:
    print tag.text.strip()

Output:

(bs4)macbook:bs4 joeyoung$ python bs4demo.py
Elsie
Lacie
Tillie


来源:https://stackoverflow.com/questions/33510881/beautifulsoup-extracting-strings-within-html-tags-resultset-objects

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!