Pull Data/Links from Google Searches using Beautiful Soup

别来无恙 提交于 2019-12-06 12:45:55

问题


Evening Folks,

I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (i.e. I search "site: Wikipedia.com Thomas Jefferson" and it gives me wiki.com/jeff, wiki.com/tom, etc.)

Here's my code:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google compatibility purposes

soup = BeautifulSoup(urlopen("https://www.google.com/?gws_rd=ssl#q=site:wikipedia.com+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.

Here is a picture of a Google results page

I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.

Here is a picture of the Inspect Element

My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?

PS: To give an idea of what I am trying to accomplish, I have found two relatively close Stack Overflow questions like mine:

beautiful soup extract a href from google search

How to collect data of Google Search with beautiful soup using python


回答1:


I got a different URL than Rob M. when I tried searching with JavaScript disabled -

https://www.google.com/search?q=site:wikipedia.com+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw

To make this work with any query, you should first make sure that your query has no spaces in it (that's why you'll get a 400: Bad Request). You can do this using urllib.quote_plus():

query = "Thomas Jefferson"
query = urllib.quote_plus(query)

which will urlencode all of the spaces as plus signs - creating a valid URL.

However, this does not work with urllib - you get a 403: Forbidden. I got it to work by using the python-requests module like this:

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google.com/search?q=site:wikipedia.com+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia.com so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

Printing links gives:

print links
#  [u'http://en.wikipedia.com/wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia.com/wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia.com/wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia.com/wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia.com/wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia.com/wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia.com/wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia.com/wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']



回答2:


This isn't going to work with the hash search (#q=site:wikipedia.com like you have it) as that loads the data in via AJAX rather than serving you the full parseable HTML with the results, you should use this instead:

soup = BeautifulSoup(urlopen("https://www.google.com/search?gbv=1&q=site:wikipedia.com+" + query), "html.parser")

For reference, I disabled javascript and performed a google search to get this url structure.



来源:https://stackoverflow.com/questions/35589897/pull-data-links-from-google-searches-using-beautiful-soup

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!