Python + BeautifulSoup: How to get wrapper out of HTML based on text?

為{幸葍}努か 提交于 2019-12-20 06:18:26

问题


Would like to get the wrapper of a key text. For example, in HTML:

…
<div class=“target”>chicken</div>
<div class=“not-target”>apple</div>
…

And by based on the text “chicken”, would like to get back <div class=“target”>chicken</div>.

Currently, have the following to fetch the HTML:

import requests
from bs4 import BeautifulSoup

req = requests.get(url).txt
soup = BeautifulSoup(r, ‘html.parser’)

And having to just do soup.find_all(‘div’,…) and loop through all available div to find the wrapper that I am looking for.

But without having to loop through every div, What would be the proper and most optimal way of fetching the wrapper in HTML based on a defined text?

Thank you in advance and will be sure to accept/upvote answer!


回答1:


# coding: utf-8

html_doc = """
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
    <title> Last chicken leg on stock! Only 500$ !!! </title>
  </head>
  </body>
    <div id="layer1" class="class1">
        <div id="layer2" class="class2">
            <div id="layer3" class="class3">
                <div id="layer4" class="class4">
                    <div id="layer5" class="class5">
                      <p>My chicken has <span style="color:blue">ONE</span> leg :P</p>
                        <div id="layer6" class="class6">
                            <div id="layer7" class="class7">
                              <div id="chicken_surname" class="chicken">eat me</div>
                                <div id="layer8" class="class8">
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
  </body>
</html>"""

from bs4 import BeautifulSoup as BS
import re
soup = BS(html_doc, "lxml")


# (tag -> text) direction is pretty obvious that way
tag = soup.find('div', class_="chicken")
tag2 = soup.find('div', {'id':"chicken_surname"})
print('\n###### by_cls:')
print(tag)
print('\n###### by_id:')
print(tag2)

# but can be tricky when need to find tag by substring
tag_by_str = soup.find(string="eat me")
tag_by_sub = soup.find(string="eat")
tag_by_resub = soup.find(string=re.compile("eat"))
print('\n###### tag_by_str:')
print(tag_by_str)
print('\n###### tag_by_sub:')
print(tag_by_sub)
print('\n###### tag_by_resub:')
print(tag_by_resub)

# there are more than one way to access underlying strings
# both are different - see results
tag = soup.find('p')

print('\n###### .text attr:')
print( tag.text, type(tag.text) )

print('\n###### .strings generator:')
for s in tag.strings:   # strings is an generator object
    print s, type(s)

# note that .strings generator returns list of bs4.element.NavigableString elements
# so we can use them to navigate, for example accessing their parents:
print('\n###### NavigableString parents:')
for s in tag.strings:  
    print s.parent

# or even grandparents :)
print('\n###### grandparents:')
for s in tag.strings:  
    print s.parent.parent


来源:https://stackoverflow.com/questions/44318342/python-beautifulsoup-how-to-get-wrapper-out-of-html-based-on-text

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!