I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:
- One
While using samaspin's solution, if there are non english unicode characters, then the parser stops working and just returns an empty string. Initialising the parser for each loop ensures that the even if the parser object gets corrupted, it does not return empty string for the subsequent parsings. Adding to samaspin's solution ,the handling of the tag as well.
In term of processing the HTML code and not cleaning the html tags, the subsequent tags can be added and their expected output written in the function handle_starttag
class MyHTMLParser(HTMLParser):
"""
This class will be used to clean the html tags whilst ensuring the
format is maintained. Therefore all the whitespces, newlines, linebrakes, etc are
converted from html tags to their respective counterparts in python.
"""
def __init__(self):
HTMLParser.__init__(self)
def feed(self, in_html):
self.output = ""
super(MyHTMLParser, self).feed(in_html)
return self.output
def handle_data(self, data):
self.output += data.strip()
def handle_starttag(self, tag, attrs):
if tag == 'li':
self.output += linesep + '* '
elif tag == 'blockquote':
self.output += linesep + linesep + '\t'
elif tag == 'br':
self.output += linesep + '\n'
def handle_endtag(self, tag):
if tag == 'blockquote':
self.output += linesep + linesep
parser = MyHTMLParser()