Google Search Web Scraping with Python

流过昼夜 提交于 2019-11-27 16:56:05

问题


I've been learning a lot of python lately to work on some projects at work.

Currently I need to do some web scraping with google search results. I found several sites that demonstrated how to use ajax google api to search, however after attempting to use it, it appears to no longer be supported. Any suggestions?

I've been searching for quite a while to find a way but can't seem to find any solutions that currently work.


回答1:


You can always directly scrape Google results. To do this, you can use the URL https://google.com/search?q=<Query> this will return the top 10 search results.

Then you can use lxml for example to parse the page. Depending on what you use, you can either query the resulting node tree via a CSS-Selector (.r a) or using a XPath-Selector (//h3[@class="r"]/a)

In some cases the resulting URL will redirect to Google. Usually it contains a query-parameter qwhich will contain the actual request URL.

Example code using lxml and requests:

from urllib.parse import urlencode, urlparse, parse_qs

from lxml.html import fromstring
from requests import get

raw = get("https://www.google.com/search?q=StackOverflow").text
page = fromstring(raw)

for result in page.cssselect(".r a"):
    url = result.get("href")
    if url.startswith("/url?"):
        url = parse_qs(urlparse(url).query)['q']
    print(url[0])

A note on google banning your IP: In my experience, google only bans if you start spamming google with search requests. It will respond with a 503 if Google thinks you are bot.




回答2:


Here is another service that can be used for scraping SERPs (https://zenserp.com) It does not require a client and is cheaper.

Here is a python code sample:

import requests

headers = {
    'apikey': '',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)



回答3:


You can also use a third party service like Serp API that is a Google search engine results. It solves the issues of being blocked, and you don't have to rent proxies and do the result parsing yourself.

It's easy to integrate with Python:

from lib.google_search_results import GoogleSearchResults

params = {
    "q" : "Coffee",
    "location" : "Austin, Texas, United States",
    "hl" : "en",
    "gl" : "us",
    "google_domain" : "google.com",
    "api_key" : "demo",
}

query = GoogleSearchResults(params)
dictionary_results = query.get_dictionary()

GitHub: https://github.com/serpapi/google-search-results-python



来源:https://stackoverflow.com/questions/38619478/google-search-web-scraping-with-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!