I am trying to convert an html block to text using Python.
Input:
It's possible to use BeautifulSoup to remove unwanted scripts and similar, though you may need to experiment with a few different sites to make sure you've covered the different types of things you wish to exclude. Try this:
from requests import get
from bs4 import BeautifulSoup as BS
response = get('http://news.bbc.co.uk/2/hi/health/2284783.stm')
soup = BS(response.content, "html.parser")
for child in soup.body.children:
if child.name == 'script':
child.decompose()
print(soup.body.get_text())
soup.get_text()
outputs what you want:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.get_text())
output:
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa
Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
To keep newlines:
print(soup.get_text('\n'))
To be identical to your example, you can replace a newline with two newlines:
soup.get_text().replace('\n','\n\n')
There are some nice things here, and i might as well throw in my solution:
from html.parser import HTMLParser
def _handle_data(self, data):
self.text += data + '\n'
HTMLParser.handle_data = _handle_data
def get_html_text(html: str):
parser = HTMLParser()
parser.text = ''
parser.feed(html)
return parser.text.strip()