Python convert html to text and mimic formatting

后端 未结 4 1066
挽巷
挽巷 2020-12-31 13:56

I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:

  • One
4条回答
  •  长情又很酷
    2020-12-31 14:30

    Python's built-in html.parser (HTMLParser in earlier versions) module can be easily extended to create a simple translator that you can tailor to your exact needs. It lets you hook into certain events as the parser eats through the HTML.

    Due to its simple nature you cant navigate around the HTML tree like you could with Beautiful Soup (e.g. sibling, child, parent nodes etc) but for a simple case like yours it should be enough.

    html.parser homepage

    In your case you could use it like this by adding the appropriate formatting whenever a start-tag or end-tag of a specific type is encountered :

    from html.parser import HTMLParser
    from os import linesep
    
    class MyHTMLParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self, strict=False)
        def feed(self, in_html):
            self.output = ""
            super(MyHTMLParser, self).feed(in_html)
            return self.output
        def handle_data(self, data):
            self.output += data.strip()
        def handle_starttag(self, tag, attrs):
            if tag == 'li':
                self.output += linesep + '* '
            elif tag == 'blockquote' :
                self.output += linesep + linesep + '\t'
        def handle_endtag(self, tag):
            if tag == 'blockquote':
                self.output += linesep + linesep
    
    parser = MyHTMLParser()
    content = "
    • One
    • Two
    " print(linesep + "Example 1:") print(parser.feed(content)) content = "Some text
    More magnificent text here
    Final text" print(linesep + "Example 2:") print(parser.feed(content))

提交回复
热议问题