问题
I wish to write to a CSV file a list of all authors with their URL to a CSV file who class themselves as a specific tag on Google Scholar. For example, if we were to take 'security' I would want this output:
author url
Howon Kim https://scholar.google.pl/citations?user=YUoJP-oAAAAJ&hl=pl
Adrian Perrig https://scholar.google.pl/citations?user=n-Oret4AAAAJ&hl=pl
... ...
I have written this code which prints each author's name
# -*- coding: utf-8 -*-
import urllib.request
import csv
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
mydivs = soup.findAll("h3", { "class" : "gsc_1usr_name"})
outputFile = open('sample.csv', 'w', newline='')
outputWriter = csv.writer(outputFile)
for each in mydivs:
for anchor in each.find_all('a'):
print (anchor.text)
However, this only does it for the first page. Instead, I would like to go through every page. How can I do this?
回答1:
I'm not writing the code for you.. but I'll give you an outline for how you can.
Look at the bottom of the page. See the next button? Search for it the containing div has an id
of gsc_authors_bottom_pag
which should be easy to find. I'd do this with selenium, find the next button (right) and click it. Wait for the page to load, scrape repeat. Handle edge cases (out of pages, etc).
If the after_author=*
bit didn't change in the url you could just increment the url
start.. but unless you want to try to crack that code (unlikely) then just click the next button.
回答2:
This page use <button>
instead of <a>
for link to next/previous page.
Button to next page has aria-label="Następna"
.
There are two buttons to next page but you can use any of them.
Button has JavaScript
code to redirect to new page
window.location=url_to_next_page
but it is simple text so you can use slicing to get only url
import urllib.request
from bs4 import BeautifulSoup
url = "http://scholar.google.pl/citations?view_op=search_authors&hl=pl&mauthors=label:security"
while True:
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'lxml')
# ... do something on page ...
# find buttons to next page
buttons = soup.findAll("button", {"aria-label": "Następna"})
# exit if no buttons
if not buttons:
break
on_click = buttons[0].get('onclick')
print('javascript:', on_click)
#add `domain` and remove `window.location='` and `'` at the end
url = 'http://scholar.google.pl' + on_click[17:-1]
# converting some codes to chars
url = url.encode('utf-8').decode('unicode_escape')
print('url:', url)
BTW: if you speak Polish then you can visit on Facebook: Python Poland or Python: pierwsze kroki
来源:https://stackoverflow.com/questions/41324356/get-authors-name-and-url-for-tag-from-google-scholar