Extracting comments from news articles

浪尽此生 提交于 2019-12-24 13:44:14

问题


My question is similar to the one asked here: https://stackoverflow.com/questions/14599485/news-website-comment-analysis I am trying to extract comments from any news article. E.g. i have a news url here: http://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/ I am trying to use BeautifulSoup in python to extract the comments. However it seems the comment section is either embedded within an iframe or loaded through javascript. Viewing the source through firebug does not reveal the source of the comments section. But explicitly viewing the source of the comments through view-source feature of the browser does. How to go about extracting the comments, especially when the comments come from a different url embedded within the news web-page?

This is what i have done till now although this is not much:

    import urllib2
    from bs4 import BeautifulSoup

    opener = urllib2.build_opener()


    url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')


urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text

print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
    i=i.text.encode('ascii','ignore')
    outfile.write(i +'\n')

Any help in what I need to do or how to go about it will be much appreciated.


回答1:


its inside an iframe. check for a frame with id="dsq2".

now the iframe has a src attr which is a link to the actual site that has the comments.

so in beautiful soup: css_soup.select("#dsq2") and get the url from the src attribute. it will lead you to a page that has only comments.

to get the actual comments, after you get the page from src you can use this css selector: .post-message p

and if you want to load more comment, when you click to the more comments buttons it seems to be sending this:

http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F



来源:https://stackoverflow.com/questions/18997421/extracting-comments-from-news-articles

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!