I am trying to parse the first page of google search results. Specifically, the Title and the small Summary that is provided. Here is what I have so far:
from urllib.request import urlretrieve
import urllib.parse
from urllib.parse import urlencode, urlparse, parse_qs
import webbrowser
from bs4 import BeautifulSoup
import requests
address = 'https://google.com/#q='
# Default Google search address start
file = open( "OCR.txt", "rt" )
# Open text document that contains the question
word = file.read()
file.close()
myList = [item for item in word.split('\n')]
newString = ' '.join(myList)
# The question is on multiple lines so this joins them together with proper spacing
print(newString)
qstr = urllib.parse.quote_plus(newString)
# Encode the string
newWord = address + qstr
# Combine the base and the encoded query
print(newWord)
source = requests.get(newWord)
soup = BeautifulSoup(source.text, 'lxml')
The part I am stuck on now is going down the HTML path to parse the specific data that I want. Everything I have tried so far has just thrown an error saying that it has no attribute or it just gives back "[]".
I am new to Python and BeautifulSoup so I am not sure the syntax of how to get to where I want. I have found that these are the individual search results in the page:
Any help on what to add to parse the Title and Summary of each search result would be MASSIVELY appreciated.
Thank you!
Your url doesn't work for me. But with https://google.com/search?q=
I get results.
import urllib
from bs4 import BeautifulSoup
import requests
import webbrowser
text = 'hello world'
text = urllib.parse.quote_plus(text)
url = 'https://google.com/search?q=' + text
response = requests.get(url)
#with open('output.html', 'wb') as f:
# f.write(response.content)
#webbrowser.open('output.html')
soup = BeautifulSoup(response.text, 'lxml')
for g in soup.find_all(class_='g'):
print(g.text)
print('-----')
来源:https://stackoverflow.com/questions/47928608/how-to-use-beautifulsoup-to-parse-google-search-results-in-python