Remove all style, scripts, and html tags from an html page

后端 未结 5 1961
面向向阳花
面向向阳花 2020-12-31 07:13

Here is what I have so far:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loa         


        
5条回答
  •  情歌与酒
    2020-12-31 07:46

    Using lxml instead:

    # Requirements: pip install lxml
    
    import lxml.html.clean
    
    
    def cleanme(content):
        cleaner = lxml.html.clean.Cleaner(
            allow_tags=[''],
            remove_unknown_tags=False,
            style=True,
        )
        html = lxml.html.document_fromstring(content)
        html_clean = cleaner.clean_html(html)
        return html_clean.text_content().strip()
    
    testhtml = "\n\nTHIS IS AN EXAMPLE I need this text captured

    And this

    " cleaned = cleanme(testhtml) print (cleaned)

提交回复
热议问题