Make BeautifulSoup handle line breaks as a browser would

亡梦爱人 提交于 2020-04-07 02:58:25

问题


I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text. The problem I'm having is that sometimes web pages have newline characters "\n" that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the "\n".

Example:

Your browser probably renders the following all in one line (even though have a newline character in the middle):

This is a paragraph.

And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:

This is a paragraph.

This is another paragraph.

But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:

from bs4 import BeautifulSoup

doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a \n paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)? Are there any other simple ways around the problem?


回答1:


get_text might be helpful here:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'



回答2:


I would take a look at python-markdownify. It turns html into pretty readable text in markdown format.

It is available at pypi : https://pypi.python.org/pypi/markdownify/0.4.0

and github : https://github.com/matthewwithanm/python-markdownify



来源:https://stackoverflow.com/questions/30337528/make-beautifulsoup-handle-line-breaks-as-a-browser-would

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!