Python convert html to text and mimic formatting

后端 未结 4 1053
挽巷
挽巷 2020-12-31 13:56

I\'m learning BeautifulSoup, and found many \"html2text\" solutions, but the one i\'m looking for should mimic the formatting:

  • One
4条回答
  •  不思量自难忘°
    2020-12-31 14:10

    I have code for a more simple task: Remove HTML tags, and insert newlines at the appropriate places. Maybe this can be a starting point for you.

    Python's textwrap module might be helpful for creating indented blocks of text.

    http://docs.python.org/2/library/textwrap.html

    class HtmlTool(object):
        """
        Algorithms to process HTML.
        """
        #Regular expressions to recognize different parts of HTML. 
        #Internal style sheets or JavaScript 
        script_sheet = re.compile(r"<(script|style).*?>.*?()", 
                                  re.IGNORECASE | re.DOTALL)
        #HTML comments - can contain ">"
        comment = re.compile(r"", re.DOTALL) 
        #HTML tags: 
        tag = re.compile(r"<.*?>", re.DOTALL)
        #Consecutive whitespace characters
        nwhites = re.compile(r"[\s]+")
        #

    ,

    ,
    tags and associated closing tags p_div = re.compile(r"", re.IGNORECASE | re.DOTALL) #Consecutive whitespace, but no newlines nspace = re.compile("[^\S\n]+", re.UNICODE) #At least two consecutive newlines n2ret = re.compile("\n\n+") #A return followed by a space retspace = re.compile("(\n )") #For converting HTML entities to unicode html_parser = HTMLParser.HTMLParser() @staticmethod def to_nice_text(html): """Remove all HTML tags, but produce a nicely formatted text.""" if html is None: return u"" text = unicode(html) text = HtmlTool.script_sheet.sub("", text) text = HtmlTool.comment.sub("", text) text = HtmlTool.nwhites.sub(" ", text) text = HtmlTool.p_div.sub("\n", text) #convert

    ,

    ,
    to "\n" text = HtmlTool.tag.sub("", text) #remove all tags text = HtmlTool.html_parser.unescape(text) #Get whitespace right text = HtmlTool.nspace.sub(" ", text) text = HtmlTool.retspace.sub("\n", text) text = HtmlTool.n2ret.sub("\n\n", text) text = text.strip() return text

    There might be some superfluous regexes left in the code.

提交回复
热议问题