Error with Google search in Python: 503 Service Unavailable

笑着哭i 提交于 2019-12-22 14:54:22

问题


When I try to do in a python console:

from google import search
urls = search("site:facebook.com inurl:login", stop=20)
for url in urls:
    print(url)

In order to search login pages, I obtain an error:

urllib.error.HTTPError: HTTP Error 503: Service Unavailable

However, if I try to search it in Google manually it works, might google be blocking my query?


回答1:


Google does try to prevent "unexpected" queries from going through. In the normal browser UI it would serve a captcha. It will take into account the traffic pattern (too rapid searches with "smart" queries, IP block known to be used by spammers) and the behavior of the client.

You can examine the error's detail by catching it.

try:
    urls = search("site:facebook.com inurl:login", stop=20)
except urllib.error.HTTPError as httperr:
    print(httperr.headers)  # Dump the headers to see if there's more information
    print(httperr.read())   # You can even read this error object just like a normal response file



回答2:


Like do Cong Ma said in his answer doing many automated searches on google will result in google block you and you'll get error 503. Only API from google for doing seaches that is currently working is Google Custom Search API. The problem with that is that it was designed to search through your pages. And there is option to set it to search all pages (see this answer), but even then you can have only 100 seacrhes per day. Before there was option to use other APIs but like Bing and Yahoo, but neither of them are free anymore. Only free API that does internet searches is FAROO API. But there is still one option to do google search by using selenium webdriver. Selenium is used to imitate browser usage and it has options to use Firefox, Chrome, Edge or Safari webdrivers (it actually opens Chrome and does your search), but this is annoying because you don't actually want to see the browser. But there is solution for this you can use PhantomJS. Download from here. Extracted and see how to use it in the example below (I wrote a simple class which you can use, you just need to change the path to PhantomJS):

import time
from urllib.parse import quote_plus
from selenium import webdriver


class Browser:

    def __init__(self, path, initiate=True, implicit_wait_time = 10, explicit_wait_time = 2):
        self.path = path
        self.implicit_wait_time = implicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        self.explicit_wait_time = explicit_wait_time    # http://www.aptuz.com/blog/selenium-implicit-vs-explicit-waits/
        if initiate:
            self.start()
        return

    def start(self):
        self.driver = webdriver.PhantomJS(path)
        self.driver.implicitly_wait(self.implicit_wait_time)
        return

    def end(self):
        self.driver.quit()
        return

    def go_to_url(self, url, wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        self.driver.get(url)
        print('[*] Fetching results from: {}'.format(url))
        time.sleep(wait_time)
        return

    def get_search_url(self, query, page_num=0, per_page=10, lang='en'):
        query = quote_plus(query)
        url = 'https://www.google.hr/search?q={}&num={}&start={}&nl={}'.format(query, per_page, page_num*per_page, lang)
        return url

    def scrape(self):
        #xpath migth change in future
        links = self.driver.find_elements_by_xpath("//h3[@class='r']/a[@href]") # searches for all links insede h3 tags with class "r"
        results = []
        for link in links:
            d = {'url': link.get_attribute('href'),
                 'title': link.text}
            results.append(d)
        return results

    def search(self, query, page_num=0, per_page=10, lang='en', wait_time = None):
        if wait_time is None:
            wait_time = self.explicit_wait_time
        url = self.get_search_url(query, page_num, per_page, lang)
        self.go_to_url(url, wait_time)
        results = self.scrape()
        return results




path = '<YOUR PATH TO PHANTOMJS>/phantomjs-2.1.1-windows/bin/phantomjs.exe' ## SET YOU PATH TO phantomjs
br = Browser(path)
results = br.search('site:facebook.com inurl:login')
for r in results:
    print(r)

br.end()


来源:https://stackoverflow.com/questions/43630630/error-with-google-search-in-python-503-service-unavailable

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!