Iterating through a DOM with BeautifulSoup/Python

冷暖自知 提交于 2019-12-11 03:38:54

问题


I have this DOM:

<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>

I'd like to generate an iterator that returns 'Main Section', 'Bla bla bla', 'Subsection', etc. Is there a way to this with BeautifulSoup?


回答1:


Here's one way to do it. The idea is to iterate over main sections (h2 tag) and for every h2 tag iterate over siblings until next h2 tag:

from bs4 import BeautifulSoup, Tag


data = """<h2>Main Section</h2>
<p>Bla bla bla<p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>


<h2>Main Section 2</h2>
<p>bla</p>
<h3>Subsection</h3>
<p>Some more info</p>

<h3>Subsection 2</h3>
<p>Even more info!</p>"""


soup = BeautifulSoup(data)
for main_section in soup.find_all('h2'):
    for sibling in main_section.next_siblings:
        if not isinstance(sibling, Tag):
            continue
        if sibling.name == 'h2':
            break
        print sibling.text
    print "-------"

prints:

Bla bla bla


Subsection
Some more info
Subsection 2
Even more info!
-------
bla
Subsection
Some more info
Subsection 2
Even more info!
-------

Hope that helps.



来源:https://stackoverflow.com/questions/22496401/iterating-through-a-dom-with-beautifulsoup-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!