Get authors name and URL for tag from google scholar

。_饼干妹妹 提交于 2019-12-12 01:36:13

问题


I wish to write to a CSV file a list of all authors with their URL to a CSV file who class themselves as a specific tag on Google Scholar. For example, if we were to take 'security' I would want this output:

author          url
Howon Kim       https://scholar.google.pl/citations?user=YUoJP-oAAAAJ&hl=pl
Adrian Perrig   https://scholar.google.pl/citations?user=n-Oret4AAAAJ&hl=pl
...             ...

I have written this code which prints each author's name

# -*- coding: utf-8 -*-
import urllib.request
import csv
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
mydivs = soup.findAll("h3", { "class" : "gsc_1usr_name"})
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
for each in mydivs:
    for anchor in each.find_all('a'):
        print (anchor.text)

However, this only does it for the first page. Instead, I would like to go through every page. How can I do this?


回答1:


I'm not writing the code for you.. but I'll give you an outline for how you can.

Look at the bottom of the page. See the next button? Search for it the containing div has an id of gsc_authors_bottom_pag which should be easy to find. I'd do this with selenium, find the next button (right) and click it. Wait for the page to load, scrape repeat. Handle edge cases (out of pages, etc).

If the after_author=* bit didn't change in the url you could just increment the url start.. but unless you want to try to crack that code (unlikely) then just click the next button.




回答2:


This page use <button> instead of <a> for link to next/previous page.

Button to next page has aria-label="Następna".

There are two buttons to next page but you can use any of them.

Button has JavaScript code to redirect to new page

 window.location=url_to_next_page

but it is simple text so you can use slicing to get only url

import urllib.request
from bs4 import BeautifulSoup

url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"

while True:    
    page = urllib.request.urlopen(url)
    soup = BeautifulSoup(page, 'lxml')

    # ... do something on page ...

    # find buttons to next page
    buttons = soup.findAll("button", {"aria-label": "Następna"})

    # exit if no buttons
    if not buttons:
        break

    on_click = buttons[0].get('onclick')

    print('javascript:', on_click)

    #add `domain` and remove `window.location='` and `'` at the end
    url = 'http://scholar.google.pl' + on_click[17:-1]
    # converting some codes to chars 
    url = url.encode('utf-8').decode('unicode_escape')

    print('url:', url)

BTW: if you speak Polish then you can visit on Facebook: Python Poland or Python: pierwsze kroki



来源:https://stackoverflow.com/questions/41324356/get-authors-name-and-url-for-tag-from-google-scholar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!