问题
I want to know about how I can collect all the URL's and from the page source using beautiful soup and can visit all of them one by one in the google search results and move to next google index pages.
here is the URL https://www.google.com/search?q=site%3Awww.rashmi.com&rct=j that I want to collect and screen shot here http://www.rashmi.com/blog/wp-content/uploads/2014/11/screencapture-www-google-com-search-1433026719960.png
here is the code I'm trying
def getPageLinks(page):
links = []
for link in page.find_all('a'):
url = link.get('href')
if url:
if 'www.rashmi.com/' in url:
links.append(url)
return links
def Links(url):
pUrl = urlparse(url)
return parse_qs(pUrl.query)[0]
def PagesVisit(browser, printInfo):
pageIndex = 1
visited = []
time.sleep(5)
while True:
browser.get("https://www.google.com/search?q=site:www.rashmi.com&ei=50hqVdCqJozEogS7uoKADg" + str(pageIndex)+"&start=10&sa=N")
pList = []
count = 0
pageIndex += 1
回答1:
Try this it should work.
def getPageLinks(page):
links = []
for link in page.find_all('a'):
url = link.get('href')
if url:
if 'www.rashmi.com/' in url:
links.append(url)
return links
def Links(url):
pUrl = urlparse(url)
return parse_qs(pUrl.query)
def PagesVisit(browser, printInfo):
start = 0
visited = []
time.sleep(5)
while True:
browser.get("https://www.google.com/search?q=site:www.rashmi.com&ei=V896VdiLEcPmUsK7gdAH&" + str(start) + "&sa=N")
pList = []
count = 0
# Random sleep to make sure everything loads
time.sleep(random.randint(1, 5))
page = BeautifulSoup(browser.page_source)
start +=10
if start ==500:
browser.close()
来源:https://stackoverflow.com/questions/30552470/how-to-collect-data-of-google-search-with-beautiful-soup-using-python