BeatifulSoup4 get_text still has javascript

前端 未结 2 1708
难免孤独
难免孤独 2020-12-13 00:04

I\'m trying to remove all the html/javascript using bs4, however, it doesn\'t get rid of javascript. I still see it there with the text. How can I get around this?

I

相关标签:
2条回答
  • 2020-12-13 00:31

    To prevent encoding errors at the end...

    import urllib
    from bs4 import BeautifulSoup
    
    url = url
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
    
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out
    
    # get text
    text = soup.get_text()
    
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    print(text.encode('utf-8'))
    
    0 讨论(0)
  • 2020-12-13 00:39

    Based partly on Can I remove script tags with BeautifulSoup?

    import urllib
    from bs4 import BeautifulSoup
    
    url = "http://www.cnn.com"
    html = urllib.urlopen(url).read()
    soup = BeautifulSoup(html)
    
    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.decompose()    # rip it out
    
    # get text
    text = soup.get_text()
    
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    
    print(text)
    
    0 讨论(0)
提交回复
热议问题