Identifying large bodies of text via BeautifulSoup or other python based extractors

后端 未结 2 404
独厮守ぢ
独厮守ぢ 2021-01-31 06:25

Given some random news article, I want to write a web crawler to find the largest body of text present, and extract it. The intention is to extract the physical news article on

2条回答
  •  误落风尘
    2021-01-31 07:06

    You're really not going about it the right way, I would say, as all the comments above would attest to.

    That said, this does what you're looking for.

    from bs4 import BeautifulSoup as BS
    import requests
    html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
    soup = BS(html)
    print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])
    

    It pulls out only the text, first by finding the main container of all the

    tags, then by selecting only the

    tags themselves to get the text; ignoring the

提交回复
热议问题