Identifying large bodies of text via BeautifulSoup or other python based extractors

我是研究僧i 提交于 2019-12-02 17:48:16

You might look at the python-readability package which does exactly this for you.

You're really not going about it the right way, I would say, as all the comments above would attest to.

That said, this does what you're looking for.

from bs4 import BeautifulSoup as BS
import requests
html = requests.get('http://www.cnn.com/2013/01/04/justice/ohio-rape-online-video/index.html?hpt=hp_c2').text
soup = BS(html)
print '\n\n'.join([k.text for k in soup.find(class_='cnn_strycntntlft').find_all('p')])

It pulls out only the text, first by finding the main container of all the <p> tags, then by selecting only the <p> tags themselves to get the text; ignoring the <script> and other irrelevant ones.

As was mentioned in the comments, this will only work for CNN--and possibly, only this page. You might need a different strategy for every new webpage.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!