BeautifulSoup Grab Visible Webpage Text

前端 未结 10 783
北恋
北恋 2020-11-22 07:35

Basically, I want to use BeautifulSoup to grab strictly the visible text on a webpage. For instance, this webpage is my test case. And I mainly want to just get the

10条回答
  •  不要未来只要你来
    2020-11-22 08:00

    The approved answer from @jbochi does not work for me. The str() function call raises an exception because it cannot encode the non-ascii characters in the BeautifulSoup element. Here is a more succinct way to filter the example web page to visible text.

    html = open('21storm.html').read()
    soup = BeautifulSoup(html)
    [s.extract() for s in soup(['style', 'script', '[document]', 'head', 'title'])]
    visible_text = soup.getText()
    

提交回复
热议问题