Remove all style, scripts, and html tags from an html page

后端 未结 5 1967
面向向阳花
面向向阳花 2020-12-31 07:13

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loa         


        
5条回答
  •  孤独总比滥情好
    2020-12-31 07:33

    Removing specified tags and comments in a clean manner. Thanks to Kim Hyesung for this code.

    from bs4 import BeautifulSoup
    from bs4 import Comment
    
    def cleanMe(html):
        soup = BeautifulSoup(html, "html5lib")    
        [x.extract() for x in soup.find_all('script')]
        [x.extract() for x in soup.find_all('style')]
        [x.extract() for x in soup.find_all('meta')]
        [x.extract() for x in soup.find_all('noscript')]
        [x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
        return soup
    

提交回复
热议问题