问题
So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at:
https://en.wikipedia.org/wiki/Category:Class-based_programming_languages
I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be:
- base:
en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500
- base:
en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat
However, I can't find a way to accomplish this using Python. Can anyone help me out here?
This is for independent study and I've spent a lot of time on this, but just can't seem to figure it out. Also, the use of Beautifulsoup is prohibited. Thank you for all the help!
回答1:
Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:
pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
print (x['title'])
And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!
回答2:
import requests
from lxml import html
wiki_page = requests.get('https://en.wikipedia.org/wiki/Category:Class based_programming_languages')
tree = html.fromstring(wiki_page.content)
To build your intuition of how to use this, right click on, say, 'C++', and click 'inspect' and you'll see the panel in the right will have highlighted
<a class="CategoryTreeLabel CategoryTreeLabelNs14
CategoryTreeLabelCategory" href="/wiki/Category:C%2B%2B">C++</a>
Right click on this, and click 'copy xpath'. For C++ this will give you
//*[@id="mw-subcategories"]/div/ul[1]/li/div/div[1]/a
Similarly, under the pages, for 'ActionScript' we get
//*[@id="mw-pages"]/div/div/div[1]/ul/li[1]/a
So if you're looking for all the subcategory/page names, you could do, for example
pages = tree.xpath('//*[@id="mw-pages"]/text()')
subcategories = tree.xpath('//*[@id="mw-subcategories"]/text()')
For more information see here and here
来源:https://stackoverflow.com/questions/42495405/how-to-scrape-subcategories-and-pages-in-categories-of-a-category-wikipedia-page