Using Beautiful Soup Python module to replace tags with plain text

后端 未结 2 1982
花落未央
花落未央 2021-01-07 05:14

I am using Beautiful Soup to extract \'content\' from web pages. I know some people have asked this question before and they were all pointed to Beautiful Soup and that\'s h

2条回答
  •  盖世英雄少女心
    2021-01-07 05:21

    When I tried to flatten tags in the document, that way, the tags' entire content would be pulled up to its parent node in place (I wanted to reduce the content of a p tag with all sub-paragraphs, lists, div and span, etc. inside but get rid of the style and font tags and some horrible word-to-html generator remnants), I found it rather complicated to do with BeautifulSoup itself since extract() also removes the content and replaceWith() unfortunatetly doesn't accept None as argument. After some wild recursion experiments, I finally decided to use regular expressions either before or after processing the document with BeautifulSoup with the following method:

    import re
    def flatten_tags(s, tags):
       pattern = re.compile(r"<(( )*|/?)(%s)(([^<>]*=\\\".*\\\")*|[^<>]*)/?>"%(isinstance(tags, basestring) and tags or "|".join(tags)))
       return pattern.sub("", s)
    

    The tags argument is either a single tag or a list of tags to be flattened.

提交回复
热议问题