Extract text with line break in BeautifulSoup

折月煮酒 提交于 2020-02-25 04:07:58

问题


I'd like to extract text with line break along with "br" tag with BeautifulSoup.

html = "<td class="s4 softmerge" dir="ltr"><div class="softmerge-inner" style="width: 5524px; left: -1px;">But when he saw many of the Pharisees and Sadducees come to his baptism, he said unto them, <br/>O generation of vipers, who hath warned you to flee from the wrath to come?<br/>Bring forth therefore fruits meet for repentance:<br/>And think not to say within yourselves, We have Abraham to our father: for I say unto you, that God is able of these stones to raise up children unto Abraham.<br/>And now also the axe is laid unto the root of the trees: therefore every tree which bringeth not forth good fruit is hewn down, and cast into the fire.<br/>I indeed baptize you with water unto repentance. but he that cometh after me is mightier than I, whose shoes I am not worthy to bear: he shall baptize you with the Holy Ghost, and with fire:<br/>Whose fan is in his hand, and he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.</div></td>"

I want to get result like this in string;

But when he saw many of the Pharisees and Sadducees come to his baptism, he said unto them,
O generation of vipers, who hath warned you to flee from the wrath to come?
Bring forth therefore fruits meet for repentance:
And think not to say within yourselves, We have Abraham to our father: for I say unto you, that God is able of these stones to raise up children unto Abraham.
And now also the axe is laid unto the root of the trees: therefore every tree which bringeth not forth good fruit is hewn down, and cast into the fire.
I indeed baptize you with water unto repentance. but he that cometh after me is mightier than I, whose shoes I am not worthy to bear: he shall baptize you with the Holy Ghost, and with fire:
Whose fan is in his hand, and he will throughly purge his floor, and gather his wheat into the garner; but he will burn up the chaff with unquenchable fire.

How can I code to get this result?


回答1:


There are two way to get result

  • Match each string inside tag,
  • see if it belongs to NavigableString

The code

soup = BeautifulSoup(html,"lxml")
for ele in soup.find("div",class_="softmerge-inner"):
    if isinstance(ele,NavigableString):
        print(ele)
print()

result = [ele[1] for ele in re.findall(r"""(<div.*?>|<br.>)(.*?)(?=<\w{1,4}/>|</\w{1,4}>)""",html)]
for e in result:
    print(e)



回答2:


Sorry if this isn't what you're looking for, but you can try replace or regex.

For example, you can use regex by making a filter that finds all <br> tags and replaces them with newlines (\n).

If you're using the BeautifulSoup object, I believe you need to use its string attribute: html = soupelement.string.

import re
regex = re.compile(r"<br/?>", re.IGNORECASE) # the filter, it finds <br> tags that may or may not have slashes
html = 'blah blah b<br>lah <br/> bl<br/>' 
newtext = re.sub(regex, '\n', html) # replaces matches with the newline
print(newtext)
# Returns 'blah blah b\nlah \n bl\n' !



回答3:


You can try this

html = '''<p>Hi</p>
<p>how are you </p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html5lib')

print(soup.getText())


来源:https://stackoverflow.com/questions/53145500/extract-text-with-line-break-in-beautifulsoup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!