Make BeautifulSoup handle line breaks as a browser would

问题

I'm using BeautifulSoup (version '4.3.2' with Python 3.4) to convert html documents to text. The problem I'm having is that sometimes web pages have newline characters "\n" that wouldn't actually get rendered as a new line in a browser, but when BeautifulSoup converts them to text, it leaves in the "\n".

Example:

Your browser probably renders the following all in one line (even though have a newline character in the middle):

This is a paragraph.

And your browser probably renders the following in multiple lines even though I'm entering it with no newlines:

This is a paragraph.

This is another paragraph.

But when BeautifulSoup converts the same strings to text, the only line line breaks it uses are the newline literals - and it always uses them:

from bs4 import BeautifulSoup

doc = "<p>This is a\nparagraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[181]: 'This is a \n paragraph.'

doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
soup = BeautifulSoup(doc)

soup.text
Out[187]: 'This is a paragraph.This is another paragraph.'

Does anyone know how to make BeautifulSoup extract text in a more beautiful way (or really just get all the newlines correct)? Are there any other simple ways around the problem?

回答1:

get_text might be helpful here:

>>> from bs4 import BeautifulSoup
>>> doc = "<p>This is a paragraph.</p><p>This is another paragraph.</p>"
>>> soup = BeautifulSoup(doc)
>>> soup.get_text(separator="\n")
u'This is a paragraph.\nThis is another paragraph.'

回答2:

I would take a look at python-markdownify. It turns html into pretty readable text in markdown format.

It is available at pypi : https://pypi.python.org/pypi/markdownify/0.4.0

and github : https://github.com/matthewwithanm/python-markdownify

来源：https://stackoverflow.com/questions/30337528/make-beautifulsoup-handle-line-breaks-as-a-browser-would

标签

python

html

beautifulsoup