问题
I'm trying to parse a number of web pages with text, tables and html. Every page has a different number of paragraphs, but while every paragraph begins with an opening <div>, the closing </div> does not occur until the end. I'm just trying to get the content, filtering out certain elements and replacing them by something else
Desired result: text1 <b>text2</b> (table_deleted) text3
Actual result text1\n\ntext2some text heretext 3text2some text heretext 3 (table deleted)
from bs4 import BeautifulSoup
html = """
<h1>title</h1>
<h3>extra data</h3>
<div>
text1
<div>
<b>next2</b><table>some text here</table>text 3
</div>
</div>"""
soup = BeautifulSoup(html, 'html5lib')
tags = soup.find('h3').find_all_next()
contents = ""
for tag in tags:
if tag.name == 'table':
contents += " (table deleted) "
contents += tag.text.strip()
print(contents)
回答1:
Don't use html5lib as parser instead use html.parser. That being said, you can access the "div" that is immediately after your "h3" tag using a css selector and the select_one method.
From there, you can unwrap the following "div" tag and replace the "table" tag using the replace_with method
In [107]: from bs4 import BeautifulSoup
In [108]: html = """
...: <h1>title</h1>
...: <h3>extra data</h3>
...: <div>
...: text1
...: <div>
...: <b>next2</b><table>some text here</table>text 3
...: </div>
...: </div>"""
In [109]: soup = BeautifulSoup(html, 'html.parser')
In [110]: my_div = soup.select_one('h3 + div')
In [111]: my_div
Out[111]:
<div>
text1
<div>
<b>next2</b><table>some text here</table>text 3
</div>
</div>
In [112]: my_div.div.unwrap()
Out[112]: <div></div>
In [113]: my_div
Out[113]:
<div>
text1
<b>next2</b><table>some text here</table>text 3
</div>
In [114]: my_div.table.replace_with('(table deleted)')
Out[114]: <table>some text here</table>
In [115]: my_div
Out[115]:
<div>
text1
<b>next2</b>(table deleted)text 3
</div>
来源:https://stackoverflow.com/questions/42593383/parsing-nested-divs-with-beautifulsoup