How to collect data of Google Search with beautiful soup using python

此生再无相见时 提交于 2020-01-07 04:18:14

问题


I want to know about how I can collect all the URL's and from the page source using beautiful soup and can visit all of them one by one in the google search results and move to next google index pages.

here is the URL https://www.google.com/search?q=site%3Awww.rashmi.com&rct=j that I want to collect and screen shot here http://www.rashmi.com/blog/wp-content/uploads/2014/11/screencapture-www-google-com-search-1433026719960.png

here is the code I'm trying

def getPageLinks(page):
links = []
for link in page.find_all('a'):
    url = link.get('href')
    if url:
        if 'www.rashmi.com/' in url:
            links.append(url)
return links

def Links(url):
pUrl = urlparse(url)
return parse_qs(pUrl.query)[0]

def PagesVisit(browser, printInfo):
pageIndex = 1
visited = []
time.sleep(5)
while True:  
    browser.get("https://www.google.com/search?q=site:www.rashmi.com&ei=50hqVdCqJozEogS7uoKADg" + str(pageIndex)+"&start=10&sa=N")
    pList = []
    count = 0

    pageIndex += 1

回答1:


Try this it should work.

def getPageLinks(page):
links = []
for link in page.find_all('a'):
url = link.get('href')
if url:
    if 'www.rashmi.com/' in url:
        links.append(url)
return links

def Links(url):
pUrl = urlparse(url)
return parse_qs(pUrl.query)

def PagesVisit(browser, printInfo):
    start = 0
    visited = []
    time.sleep(5)
    while True:  
            browser.get("https://www.google.com/search?q=site:www.rashmi.com&ei=V896VdiLEcPmUsK7gdAH&" + str(start) + "&sa=N")


    pList = []
    count = 0
    # Random sleep to make sure everything loads
    time.sleep(random.randint(1, 5))
    page = BeautifulSoup(browser.page_source)


    start +=10      
    if start ==500:
    browser.close()   


来源:https://stackoverflow.com/questions/30552470/how-to-collect-data-of-google-search-with-beautiful-soup-using-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!