问题
As i want to remove duplicated placeholders in a html website, i use the .next_sibling operator of BeautifulSoup. As long as the duplicates are in the same line, this works fine (see data). But sometimes there is a empty line between them - so i want .next_sibling to ignore them (have a look at data2)
That is the code:
from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
"""
soup = BeautifulSoup(data)
string = 'method-removed-here'
for p in soup.find_all("p"):
while isinstance(p.next_sibling, Tag) and p.next_sibling.name== 'p' and p.text==string:
p.next_sibling.decompose()
print(soup)
Output for data is as expected:
<html><head></head><body><p>method-removed-here</p></body></html>
Output for data2 (this needs to be fixed):
<html><head></head><body><p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
</body></html>
I couldn't find useful information for that in the BeautifulSoup4 documentation and .next_element is also not what i am looking for.
回答1:
I could solve this issue with a workaround. The problem is described in the google-group for BeautifulSoup and they suggest to use a preprocessor for html-files:
def bs_preprocess(html):
"""remove distracting whitespaces and newline characters"""
pat = re.compile('(^[\s]+)|([\s]+$)', re.MULTILINE)
html = re.sub(pat, '', html) # remove leading and trailing whitespaces
html = re.sub('\n', ' ', html) # convert newlines to spaces
# this preserves newline delimiters
html = re.sub('[\s]+<', '<', html) # remove whitespaces before opening tags
html = re.sub('>[\s]+', '>', html) # remove whitespaces after closing tags
return html
That's not the very best solution but one.
回答2:
Also not a great solution but this worked for me
def get_sibling(element):
sibling = element.next_sibling
if sibling == "\n":
return get_sibling(sibling)
else:
return sibling
回答3:
Improving a bit neurosnap answer by making it general:
def next_elem(element, func):
new_elem = getattr(element, func)
if new_elem == "\n":
return next_elem(new_elem, func)
else:
return new_elem
Now you can call any function with it, for example:
next_elem(element, 'previous_sibling')
回答4:
use find_next_sibling()
instead of next_sibling
. same goes for find_previous_sibling()
instead of previous_sibling
.
reason: next_sibling
does not necessarily return the next html tag but instead the next "soup element". usually that is a only a newline but can be more. find_next_sibling()
on the other hand return the next html tag ignoring whitespace and other crud between the tags.
i restructured your code a bit to make this demonstration. i hope it is semantically the same.
code with next_sibling
demonstrating the same behaviour that you described (works for data
but not data2
)
from bs4 import BeautifulSoup, Tag
data = "<p>method-removed-here</p><p>method-removed-here</p><p>method-removed-here</p>"
data2 = """<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
<p>method-removed-here</p>
"""
soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
while True:
ns = p.next_sibling
if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
ns.decompose()
else:
break
print(soup)
code with find_next_sibling()
which works for both data
and data2
soup = BeautifulSoup(data, 'html.parser')
string = 'method-removed-here'
for p in soup.find_all("p"):
while True:
ns = p.find_next_sibling()
if isinstance(ns, Tag) and ns.name== 'p' and p.text==string:
ns.decompose()
else:
break
print(soup)
the same behaviour (returning all soup elements including whitespace) in other parts of beautifulsoup: BeautifulSoup .children or .content without whitespace between tags
来源:https://stackoverflow.com/questions/23241641/how-to-ignore-empty-lines-while-using-next-sibling-in-beautifulsoup4-in-python