I am trying to convert an html block to text using Python.
Input:
You can use a regular expression, but it's not recommended. The following code removes all the HTML tags in your data, giving you the text:
import re
data = """<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>"""
data = re.sub(r'<.*?>', '', data)
print(data)
Output
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
gazpacho might be a good choice for this!
Input:
from gazpacho import Soup
html = """\
<div class="body"><p><strong></strong></p>
<p><strong></strong>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. <a href="http://example.com/" target="_blank" class="source">Some Link</a> Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p>
<p>Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa</p></div>
"""
Output:
text = Soup(html).strip(whitespace=False) # to keep "\n" characters intact
print(text)
Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Some Link Aenean commodo ligula eget dolor. Aenean massa Aenean massa.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa Consectetuer adipiscing elit. Aenean commodo ligula eget dolor. Aenean massa
The '\n'
places a newline between the paragraphs.
from bs4 import Beautifulsoup
soup = Beautifulsoup(text)
print(soup.get_text('\n'))
I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the body
tag and added a convenience method so that HTML to text is a single line:
from abc import ABC
from html.parser import HTMLParser
class HTMLFilter(HTMLParser, ABC):
"""
A simple no dependency HTML -> TEXT converter.
Usage:
str_output = HTMLFilter.convert_html_to_text(html_input)
"""
def __init__(self, *args, **kwargs):
self.text = ''
self.in_body = False
super().__init__(*args, **kwargs)
def handle_starttag(self, tag: str, attrs):
if tag.lower() == "body":
self.in_body = True
def handle_endtag(self, tag):
if tag.lower() == "body":
self.in_body = False
def handle_data(self, data):
if self.in_body:
self.text += data
@classmethod
def convert_html_to_text(cls, html: str) -> str:
f = cls()
f.feed(html)
return f.text.strip()
See comment for usage.
This converts all of the text inside the body
, which in theory could include style
and script
tags. Further filtering could be achieved by extending the pattern of as shown for body
-- i.e. setting instance variables in_style
or in_script
.
It's possible using python standard html.parser:
from html.parser import HTMLParser
class HTMLFilter(HTMLParser):
text = ""
def handle_data(self, data):
self.text += data
f = HTMLFilter()
f.feed(data)
print(f.text)
I was in need of a way of doing this on a client's system without having to download additional libraries. I never found a good solution, so I created my own. Feel free to use this if you like.
import urllib
def html2text(strText):
str1 = strText
int2 = str1.lower().find("<body")
if int2>0:
str1 = str1[int2:]
int2 = str1.lower().find("</body>")
if int2>0:
str1 = str1[:int2]
list1 = ['<br>', '<tr', '<td', '</p>', 'span>', 'li>', '</h', 'div>' ]
list2 = [chr(13), chr(13), chr(9), chr(13), chr(13), chr(13), chr(13), chr(13)]
bolFlag1 = True
bolFlag2 = True
strReturn = ""
for int1 in range(len(str1)):
str2 = str1[int1]
for int2 in range(len(list1)):
if str1[int1:int1+len(list1[int2])].lower() == list1[int2]:
strReturn = strReturn + list2[int2]
if str1[int1:int1+7].lower() == '<script' or str1[int1:int1+9].lower() == '<noscript':
bolFlag1 = False
if str1[int1:int1+6].lower() == '<style':
bolFlag1 = False
if str1[int1:int1+7].lower() == '</style':
bolFlag1 = True
if str1[int1:int1+9].lower() == '</script>' or str1[int1:int1+11].lower() == '</noscript>':
bolFlag1 = True
if str2 == '<':
bolFlag2 = False
if bolFlag1 and bolFlag2 and (ord(str2) != 10) :
strReturn = strReturn + str2
if str2 == '>':
bolFlag2 = True
if bolFlag1 and bolFlag2:
strReturn = strReturn.replace(chr(32)+chr(13), chr(13))
strReturn = strReturn.replace(chr(9)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13)+chr(32), chr(13))
strReturn = strReturn.replace(chr(13)+chr(9), chr(13))
strReturn = strReturn.replace(chr(13)+chr(13), chr(13))
strReturn = strReturn.replace(chr(13), '\n')
return strReturn
url = "http://www.theguardian.com/world/2014/sep/25/us-air-strikes-islamic-state-oil-isis"
html = urllib.urlopen(url).read()
print html2text(html)