Converting html to text with Python

后端 未结 9 831
一生所求
一生所求 2020-12-12 17:49

I am trying to convert an html block to text using Python.

Input:

9条回答
  •  半阙折子戏
    2020-12-12 18:34

    I liked @FrBrGeorge's no dependency answer so much that I expanded it to only extract the body tag and added a convenience method so that HTML to text is a single line:

    from abc import ABC
    from html.parser import HTMLParser
    
    
    class HTMLFilter(HTMLParser, ABC):
        """
        A simple no dependency HTML -> TEXT converter.
        Usage:
              str_output = HTMLFilter.convert_html_to_text(html_input)
        """
        def __init__(self, *args, **kwargs):
            self.text = ''
            self.in_body = False
            super().__init__(*args, **kwargs)
    
        def handle_starttag(self, tag: str, attrs):
            if tag.lower() == "body":
                self.in_body = True
    
        def handle_endtag(self, tag):
            if tag.lower() == "body":
                self.in_body = False
    
        def handle_data(self, data):
            if self.in_body:
                self.text += data
    
        @classmethod
        def convert_html_to_text(cls, html: str) -> str:
            f = cls()
            f.feed(html)
            return f.text.strip()           
    

    See comment for usage.

    This converts all of the text inside the body, which in theory could include style and script tags. Further filtering could be achieved by extending the pattern of as shown for body -- i.e. setting instance variables in_style or in_script.

提交回复
热议问题