Scraping and parsing Google search results using Python

问题

I asked a question on realizing a general idea to crawl and save webpages. Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.

With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).

Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.

The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing. What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).

p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.

Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P

Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.

import time, random
from xgoogle.search import GoogleSearch, SearchError

f = open('a.txt','wb')

for i in range(0,2):
    wt = random.uniform(2, 5)
    gs = GoogleSearch("about")
    gs.results_per_page = 10
    gs.page = i
    results = gs.get_results()
    #Try not to annnoy Google, with a random short wait
    time.sleep(wt)
    print 'This is the %dth iteration and waited %f seconds' % (i, wt)
    for res in results:
        f.write(res.url.encode("utf8"))
        f.write("\n")

print "Done"
f.close()

Note on xgoogle (below answered by Mike Pennington): The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.

Resources known so far:

For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most popular choices. Of course. lxml too.

回答1:

You may find xgoogle useful... much of what you seem to be asking for is there...

回答2:

There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007). It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes. BTW, it's based on mechanize.

As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)

回答3:

Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py

回答4:

This one works good for this moment. If any search is made, the scraper is able to fetch 100 items of that search by going through several pages. I tried to use function to complete the code flawlessly but ipv4 issue comes up and the page gets redirected to the one with captcha. Still confused why this one works but if it is wrapped within function then it won't work anymore. Btw, the scraper looks a bit awkward cause I used the same for loop twice in my scraper so that It can't skip the content of first page.

import requests ; from bs4 import BeautifulSoup

search_item = "excel"
base = "http://www.google.de"
url = "http://www.google.de/search?q="+ search_item

response = requests.get(url)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select(".r a"):
    print(item.text)
for next_page in soup.select(".fl"):
    res = requests.get(base + next_page.get('href'))
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select(".r a"):
        print(item.text)

回答5:

Another option to scrape Google search results using Python is the one by ZenSERP.

I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.

Here is an example for a curl request:

curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"

And the response:

{
  "q": "Pied Piper",
  "domain": "google.com",
  "location": "United States",
  "language": "English",
  "url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
  "total_results": 17100000,
  "auto_correct": "",
  "auto_correct_type": "",
  "results": []
}

A Python code for example:

import requests

headers = {
    'apikey': 'APIKEY',
}

params = (
    ('q', 'Pied Piper'),
    ('location', 'United States'),
    ('search_engine', 'google.com'),
    ('language', 'English'),
)

response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)

回答6:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re

import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)

html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(urlopen(req).read(),"html.parser")

#Regex
reg=re.compile(".*&sa=")

links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    line = (reg.match(item.a['href'][7:]).group())
    links.append(line[:-4])

print(links)

this should be handy....for more go to - https://github.com/goyal15rajat/Crawl-google-search.git

回答7:

Here is a Python script using requests and BeautifulSoup to scrape Google results.

import urllib
import requests
from bs4 import BeautifulSoup

# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"

query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"

headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)

if resp.status_code == 200:
    soup = BeautifulSoup(resp.content, "html.parser")
    results = []
    for g in soup.find_all('div', class_='r'):
        anchors = g.find_all('a')
        if anchors:
            link = anchors[0]['href']
            title = g.find('h3').text
            item = {
                "title": title,
                "link": link
            }
            results.append(item)
    print(results)

The guide How To Scrape Google With Python goes into more detail on the code if you are interested. The repo to the code.

来源：https://stackoverflow.com/questions/7746832/scraping-and-parsing-google-search-results-using-python

标签

python

screen-scraping

web-scraping

google-search-api