问题
I've a python script to clean scraped html content, it uses BeautifulSoup4 and works pretty well. Recently I have decided to learn lxml but I found the tutorials are harder (for me) to follow. For example I use the following code to merge multiple <br />
tags into one, i.e, if there are more than one <br />
tags, remove all but keep just one:
from bs4 import BeautifulSoup, Tag
data = 'foo<br /><br>bar. <p>foo<br/><br id="1"><br/>bar'
soup = BeautifulSoup(data)
for br in soup.find_all("br"):
while isinstance(br.next_sibling, Tag) and br.next_sibling.name == 'br':
br.next_sibling.extract()
print soup
<html><body><p>foo<br/>bar. </p><p>foo<br/>bar</p></body></html>
How do I achieve this similar in lxml? Thanks,
回答1:
You could try .drop_tag()
method to remove duplicate consecutive occurences of <br/>
tag:
from lxml import html
doc = html.fromstring(data)
for br in doc.findall('.//br'):
if br.tail is None: # no text immediately after <br> tag
for dup in br.itersiblings():
if dup.tag != 'br': # don't merge if there is another tag inbetween
break
dup.drop_tag()
if dup.tail is not None: # don't merge if there is a text inbetween
break
print(html.tostring(doc))
# -> <div><p>foo<br>bar. </p><p>foo<br>bar</p></div>
来源:https://stackoverflow.com/questions/20779061/merge-multiple-br-tags-to-a-single-one-with-python-lxml