I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:
- One
Python's built-in html.parser (HTMLParser in earlier versions) module can be easily extended to create a simple translator that you can tailor to your exact needs. It lets you hook into certain events as the parser eats through the HTML.
Due to its simple nature you cant navigate around the HTML tree like you could with Beautiful Soup (e.g. sibling, child, parent nodes etc) but for a simple case like yours it should be enough.
html.parser homepage
In your case you could use it like this by adding the appropriate formatting whenever a start-tag or end-tag of a specific type is encountered :
from html.parser import HTMLParser
from os import linesep
class MyHTMLParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self, strict=False)
def feed(self, in_html):
self.output = ""
super(MyHTMLParser, self).feed(in_html)
return self.output
def handle_data(self, data):
self.output += data.strip()
def handle_starttag(self, tag, attrs):
if tag == 'li':
self.output += linesep + '* '
elif tag == 'blockquote' :
self.output += linesep + linesep + '\t'
def handle_endtag(self, tag):
if tag == 'blockquote':
self.output += linesep + linesep
parser = MyHTMLParser()
content = "- One
- Two
"
print(linesep + "Example 1:")
print(parser.feed(content))
content = "Some textMore magnificent text here
Final text"
print(linesep + "Example 2:")
print(parser.feed(content))