Python convert html to text and mimic formatting

后端 未结 4 1064
挽巷
挽巷 2020-12-31 13:56

I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:

  • One
4条回答
  •  情深已故
    2020-12-31 14:31

    While using samaspin's solution, if there are non english unicode characters, then the parser stops working and just returns an empty string. Initialising the parser for each loop ensures that the even if the parser object gets corrupted, it does not return empty string for the subsequent parsings. Adding to samaspin's solution ,the handling of the
    tag as well. In term of processing the HTML code and not cleaning the html tags, the subsequent tags can be added and their expected output written in the function handle_starttag

                class MyHTMLParser(HTMLParser):
                """
                This class will be used to clean the html tags whilst ensuring the
                format is maintained. Therefore all the whitespces, newlines, linebrakes, etc are
                converted from html tags to their respective counterparts in python.
    
                """
    
                def __init__(self):
                    HTMLParser.__init__(self)
    
                def feed(self, in_html):
                    self.output = ""
                    super(MyHTMLParser, self).feed(in_html)
                    return self.output
    
                def handle_data(self, data):
                    self.output += data.strip()
    
                def handle_starttag(self, tag, attrs):
                    if tag == 'li':
                        self.output += linesep + '* '
                    elif tag == 'blockquote':
                        self.output += linesep + linesep + '\t'
                    elif tag == 'br':
                        self.output += linesep + '\n'
    
                def handle_endtag(self, tag):
                    if tag == 'blockquote':
                        self.output += linesep + linesep
    
    
            parser = MyHTMLParser()
    

提交回复
热议问题