I am trying to convert an html block to text using Python.
Input:
I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the body tag and added a convenience method so that HTML to text is a single line:
from abc import ABC
from html.parser import HTMLParser
class HTMLFilter(HTMLParser, ABC):
"""
A simple no dependency HTML -> TEXT converter.
Usage:
str_output = HTMLFilter.convert_html_to_text(html_input)
"""
def __init__(self, *args, **kwargs):
self.text = ''
self.in_body = False
super().__init__(*args, **kwargs)
def handle_starttag(self, tag: str, attrs):
if tag.lower() == "body":
self.in_body = True
def handle_endtag(self, tag):
if tag.lower() == "body":
self.in_body = False
def handle_data(self, data):
if self.in_body:
self.text += data
@classmethod
def convert_html_to_text(cls, html: str) -> str:
f = cls()
f.feed(html)
return f.text.strip()
See comment for usage.
This converts all of the text inside the body, which in theory could include style and script tags. Further filtering could be achieved by extending the pattern of as shown for body -- i.e. setting instance variables in_style or in_script.