BeautifulSoup - How to get all text between two different tags?

好久不见. 提交于 2020-01-03 13:00:29

问题


I would like to get all text between two tags:

<div class="lead">I DONT WANT this</div>

#many different tags - p, table, h2 including text that I want

<div class="image">...</div>

I started this way:

url = "http://......."
req = urllib.request.Request(url)
source = urllib.request.urlopen(req)
soup = BeautifulSoup(source, 'lxml')

start = soup.find('div', {'class': 'lead'})
end = soup.find('div', {'class': 'image'})

And I have no idea what to do next


回答1:


try using the code below:

from bs4 import BeautifulSoup

soup = BeautifulSoup("""
    <html><div class="lead">lead</div>data<div class="end"></div></html>"
    """, "lxml")

node = soup.find('div', {'class': 'lead'})
s = []
while True:
    if node is None:
        break
    node = node.next_sibling
    if hasattr(node, "attrs") and ("end" in node.attrs['class'] ):
        break   
    else:
        if node is not None:
            s.append(node)
print s

using next_sibling to get the brother node.




回答2:


Try this code, it let's the parser start at class lead and exits the programm when hitting class image and prints all available tags, this can be changed to printing entire code:

html = u""
for tag in soup.find("div", { "class" : "lead" }).next_siblings:
    if soup.find("div", { "class" : "image" }) == tag:
        break
    else:
        html += unicode(tag)
print html


来源:https://stackoverflow.com/questions/45346928/beautifulsoup-how-to-get-all-text-between-two-different-tags

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!