Python - Easy way to scrape Google, download top N hits (entire .html documents) for given search?

我只是一个虾纸丫 提交于 2019-12-04 03:23:08
Mark Longair

The official way to get results from Google programmatically is to use Google's Custom Search API. As icktoofay comments, other approaches (such as directly scraping the results or using the xgoogle module) break Google's terms of service. Because of that, you might want to consider using the API from another search engine, such as the Bing API or Yahoo!'s service.

Check out BeautifulSoup for scraping the content out of web pages. It is supposed to be very tolerant of broken web pages which will help because not all results are well formed. So you should be able to:

  • Request http://www.google.ca/search?q=QUERY_HERE
  • Extract and follow result links using BeautifulSoup (It appears as though class="r" for result links)
  • Extract text from result pages using BeautifulSoup

As mentioned, scraping Google violates their TOS. That said, that's probably not the answer you're looking for.

There's a PHP script available that does a perfect job of scraping Google: http://google-scraper.squabbel.com/ Just give it a keyword, # of results you want, and it'll return all the results for you. Just parse for the URLs returned, use urllib, or curl to extract the HTML source, and you're done.

You also really shouldn't attempt to scrape Google unless you got more than 100 proxy servers though. They'll easily ban your IP temporarily after a few attempts.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!