问题
Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments.
What's a generic way of doing this that will work on most major news sites?
What are some good tools or libraries for data mining? (preferably python based)
回答1:
There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
回答2:
There are a number of ways to do it, but, none will always work. Here are the two easiest:
- if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
- Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.
回答3:
A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.
You'd use the script like: python webarticle2text.py <url>
回答4:
Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/
回答5:
For a solution in Java have a look at https://code.google.com/p/boilerpipe/ :
The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.
The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.
But there is also a python wrapper around this available here:
https://github.com/misja/python-boilerpipe
回答6:
It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.
回答7:
Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.
You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.
Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.
回答8:
Check the following script. It is really amazing:
from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)
More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:
pip3 install newspaper3k
回答9:
I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:
http://feeds.guardian.co.uk/theguardian/rss
I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...
来源:https://stackoverflow.com/questions/4672060/web-scraping-how-to-identify-main-content-on-a-webpage