Extract News article content from stored .html pages

早过忘川 提交于 2019-12-02 17:18:39

There are libraries for this in Python too :)

Since you mentioned Java, there's a Python wrapper for boilerpipe that allows you to directly use it inside a python script: https://github.com/misja/python-boilerpipe

If you want to use purely python libraries, there are 2 options:

https://github.com/buriy/python-readability

and

https://github.com/grangier/python-goose

Of the two, I prefer Goose, however be aware that the recent versions of it sometimes fail to extract text for some reason (my recommendation is to use version 1.0.22 for now)

EDIT: here's a sample code using Goose:

from goose import Goose
from requests import get

response = get('http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general')
extractor = Goose()
article = extractor.extract(raw_html=response.content)
text = article.cleaned_text

Newspaper is becoming increasingly popular, I've only used it superficially, but it looks good. It's Python 3 only.

The quickstart only shows loading from a URL, but you can load from a HTML string with:

import newspaper

# LOAD HTML INTO STRING FROM FILE...

article = newspaper.Article('') # STRING REQUIRED AS `url` ARGUMENT BUT NOT USED
article.set_html(html)

Try something like this by visiting the page directly:

##Import modules
from bs4 import BeautifulSoup
import urllib2


##Grab the page
url = http://www.example.com
req = urllib2.Request(url)
page = urllib2.urlopen(req)
content = page.read()
page.close()  

##Prepare
soup = BeautifulSoup(content) 

##Parse (a table, for example)

for link in soup.find_all("table",{"class":"myClass"}):
    ...do something...
pass

If you want to load a file, just replace the part where you grab the page with the file instead. Find out more here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Roman Susi

There are many ways to organize html-scaraping in Python. As said in other answers, the tool #1 is BeautifulSoup, but there are others:

Here are useful resources:

There is no universal way of finding the content of the article. HTML5 has article tag, hinting on the main text, and it is maybe possible to tune scraping for pages from specific publishing systems, but there is no general way to get the accurately guess text location. (Theoretically, machine can deduce page structure from looking at more than one structurally identical, different in content articles, but this is probably out of scope here.)

Also Web scraping with Python may be relevant.

Pyquery example for NYT:

from pyquery import PyQuery as pq
url = 'http://www.nytimes.com/2015/05/19/health/study-finds-dense-breast-tissue-isnt-always-a-high-cancer-risk.html?src=me&ref=general'
d = pq(url=url)
text = d('.story-content').text()

You can use htmllib or HTMLParser you can use these to parse your html file

from HTMLParser import HTMLParser

# create a subclass and override the handler methods
class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print "Encountered a start tag:", tag
    def handle_endtag(self, tag):
        print "Encountered an end tag :", tag
    def handle_data(self, data):
        print "Encountered some data  :", data

# instantiate the parser and fed it some HTML
parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1></body></html>')

A sample of code tooken from HTMLParser page

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!