I have to work with the messiest HTML where individual words are split into separate tags, like in the following example:
The solution below combines text from all the selected tags into one of your choice and decomposes the others.
If you only want to merge the text from consecutive tags follow Danny's approach.
Code:
from bs4 import BeautifulSoup
html = '''
I
NTRODUCTION
'''
soup = BeautifulSoup(html, 'lxml')
container = soup.select_one('#wrapper') # it contains b tags to combine
b_tags = container.find_all('b')
# combine all the text from b tags
text = ''.join(b.get_text(strip=True) for b in b_tags)
# here you choose a tag you want to preserve and update its text
b_main = b_tags[0] # you can target it however you want, I just take the first one from the list
b_main.span.string = text # replace the text
for tag in b_tags:
if tag is not b_main:
tag.decompose()
print(soup)
Any comments appreciated.