Web scraping - how to identify main content on a webpage

折月煮酒 提交于 2019-12-02 14:01:43

There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.

There are a number of ways to do it, but, none will always work. Here are the two easiest:

  • if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites)
  • Use the arc90 readability algorithm (reference implementation is in javascript) http://code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them. It will not work for some websites but is generally pretty good.

A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content. It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags.

You'd use the script like: python webarticle2text.py <url>

Diffbot offers a free(10.000 urls) API to do that, don't know if that approach is what you are looking for, but it might help someone http://www.diffbot.com/

For a solution in Java have a look at https://code.google.com/p/boilerpipe/ :

The boilerpipe library provides algorithms to detect and remove the surplus "clutter" (boilerplate, templates) around the main textual content of a web page.

The library already provides specific strategies for common tasks (for example: news article extraction) and may also be easily extended for individual problem settings.

But there is also a python wrapper around this available here:

https://github.com/misja/python-boilerpipe

It might be more useful to extract the RSS feeds (<link type="application/rss+xml" href="..."/>) on that page and parse the data in the feed to get the main content.

Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page.

You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content.

Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.

Check the following script. It is really amazing:

from newspaper import Article
URL = "https://www.ksat.com/money/philippines-stops-sending-workers-to-qatar"
article = Article(URL)
article.download()
print(article.html)
article.parse()
print(article.authors)
print(article.publish_date)
#print(article.text)
print(article.top_image)
print(article.movies)
article.nlp()
print(article.keywords)
print(article.summary)

More documentation can be found at http://newspaper.readthedocs.io/en/latest/ and https://github.com/codelucas/newspaper you should install it using:

pip3 install newspaper3k

I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles:

http://feeds.guardian.co.uk/theguardian/rss

I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!