问题
I am trying to strip certain HTML tags and their content from a file with BeautifulSoup. How can I remove lines that get empty after applying decompose()? In this example, I want the line between a and 3 to be gone, as this is where the <span>...</span> block was, but not the line in the end.
from bs4 import BeautifulSoup
Rmd_data = 'a\n<span class="answer">\n2\n</span>\n3\n'
print(Rmd_data)
#OUTPUT
# a
# <span class="answer">
# 2
# </span>
# 3
#
# END OUTPUT
soup = BeautifulSoup(Rmd_data, "html.parser")
answers = soup.find_all("span", "answer")
for a in answers:
a.decompose()
Rmd_data = str(soup)
print(Rmd_data)
# OUTPUT
# a
#
# 3
#
# END OUTPUT
回答1:
For removing empty lines most easy will be via re
import re
re.sub(r'[\n\s]+', r'\n', text, re.MULTLINE)
回答2:
I'm surprised that BeatifulSoup does not offer a prettify() option. Instead of manipulating the html manually you could re-parse your html:
str(BeautifulSoup(str(soup), 'html.parser'))
As always, enjoy.
来源:https://stackoverflow.com/questions/42286777/remove-lines-getting-empty-after-beautifulsoup-decompose