Updated approach to Google search with python

I was trying to use xgoogle but I has not been updated for 3 years and I just keep getting no more than 5 results even if I set 100 results per page. If anyone uses xgoogle without any problem please let me know.

Now, since the only available(apparently) wrapper is xgoogle, the option is to use some sort of browser, like mechanize, but that is gonna make the code entirely dependant on google HTML and they might change it a lot.

Final option is to use the Custom search API that google offers, but is has a redicolous 100 requests per day limit and a pricing after that.

I need help on which direction should I go, what other options do you know of and what works for you.

Thanks !

It only needs a minor patch.

The function GoogleSearch._extract_result (Line 237 of search.py) calls GoogleSearch._extract_description (Line 258) which fails causing _extract_result to return None for most of the results therefore showing fewer results than expected.

Fix:

In search.py, change Line 259 from this:

desc_div = result.find('div', {'class': re.compile(r'\bs\b')})

to this:

desc_div = result.find('span', {'class': 'st'})

I tested using:

#!/usr/bin/python
#
# This program does a Google search for "quick and dirty" and returns
# 200 results.
#

from xgoogle.search import GoogleSearch, SearchError

class give_me(object):
    def __init__(self, query, target):
        self.gs = GoogleSearch(query)
        self.gs.results_per_page = 50
        self.current = 0
        self.target = target
        self.buf_list = []

    def __iter__(self):
        return self

    def next(self):
        if self.current >= self.target:
            raise StopIteration
        else:
            if(not self.buf_list):
                self.buf_list = self.gs.get_results()
            self.current += 1
            return self.buf_list.pop(0)

try:
    sites = {}
    for res in give_me("quick and dirty", 200):
        t_dict = \
        {
            "title" : res.title.encode('utf8'),
            "desc" : res.desc.encode('utf8'),
            "url" : res.url.encode('utf8')
        }
        sites[t_dict["url"]] = t_dict
    print t_dict
except SearchError, e:
    print "Search failed: %s" % e

I think you misunderstand what xgoogle is. xgoogle is not a wrapper; it's a library that fakes being a human user with a browser, and scrapes the results. It's heavily dependent on the format of Google's search queries and results pages as of 2009, so it's no surprise that it doesn't work the same in 2013. See the announcement blog post for more details.

You can, of course, hack up the xgoogle source and try to make it work with Google's current format (as it turns out, they've only broken xgoogle by accident, and not very badly…), but it's just going to break again.

Meanwhile, you're trying to get around Google's Terms of Service:

Don’t misuse our Services. For example, don’t interfere with our Services or try to access them using a method other than the interface and the instructions that we provide.

They're been specifically asked about exactly what you're trying to do, and their answer is:

Google's Terms of Service do not allow the sending of automated queries of any sort to our system without express permission in advance from Google.

And you even say that's explicitly what you want to do:

Final option is to use the Custom search API that google offers, but is has a redicolous 100 requests per day limit and a pricing after that.

So, you're looking for a way to access Google search using a method other than the interface they provide, in a deliberate attempt to get around their free usage quota without paying. They are completely within their rights to do anything they want to break your code—and if they get enough hits from people doing things kind of thing, they will do so.

(Note that when a program is scraping the results, nobody's seeing the ads, and the ads are what pay for the whole thing.)

Of course nobody's forcing you to use Google. EntireWeb has an free "unlimited" (as in "as long as you don't use too much, and we haven't specified the limit") search API. Bing gives you a higher quota, and amortized by month instead of by day. Yahoo BOSS is flexible and super-cheap (and even offers a "discount server" that provides lower-quality results if it's not cheap enough), although I believe you're forced to type the ridiculous exclamation point. If none of them are good enough for you… then pay for Google.

来源：https://stackoverflow.com/questions/17559709/updated-approach-to-google-search-with-python

标签

python

api

rest

google-search