Remove all style, scripts, and html tags from an html page

后端 未结 5 1968
面向向阳花
面向向阳花 2020-12-31 07:13

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loa         


        
5条回答
  •  无人及你
    2020-12-31 07:25

    It looks like you almost have it. You need to also remove the html tags and css styling code. Here is my solution (I updated the function):

    def cleanMe(html):
        soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
        for script in soup(["script", "style"]): # remove all javascript and stylesheet code
            script.extract()
        # get text
        text = soup.get_text()
        # break into lines and remove leading and trailing space on each
        lines = (line.strip() for line in text.splitlines())
        # break multi-headlines into a line each
        chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
        # drop blank lines
        text = '\n'.join(chunk for chunk in chunks if chunk)
        return text
    

提交回复
热议问题