How to scrape Subcategories and pages in categories of a Category wikipedia page using Python

血红的双手。 提交于 2019-12-12 15:38:18

问题


So I'm trying to scrape all the subcategories and pages under the category header of the Category page: "Category: Class-based programming languages" found at:

https://en.wikipedia.org/wiki/Category:Class-based_programming_languages

I've figured out a way to do this using urls and the mediawiki API: Categorymembers. The way to do that would be:

  • base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500
  • base: en.wikipedia.org/w/api.php?action=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat

However, I can't find a way to accomplish this using Python. Can anyone help me out here?

This is for independent study and I've spent a lot of time on this, but just can't seem to figure it out. Also, the use of Beautifulsoup is prohibited. Thank you for all the help!


回答1:


Ok so after doing more research and study, I was able to find an answer to my own question. Using the libraries urllib.request and json, I imported the wikipedia url file in format json and simply printed its categories out that way. Here's the code I used to get the subcategories:

pages = urllib.request.urlopen("https://en.wikipedia.org/w/api.phpaction=query&list=categorymembers&cmtitle=Category:Class-based%20programming%20languages&format=json&cmlimit=500&cmtype=subcat")
data = json.load(pages)
query = data['query']
category = query['categorymembers']
for x in category:
    print (x['title'])

And you can do the same thing for pages in category. Thanks to Nemo for trying to help me out!




回答2:


import requests
from lxml import html
wiki_page = requests.get('https://en.wikipedia.org/wiki/Category:Class based_programming_languages')
tree = html.fromstring(wiki_page.content)

To build your intuition of how to use this, right click on, say, 'C++', and click 'inspect' and you'll see the panel in the right will have highlighted

<a class="CategoryTreeLabel  CategoryTreeLabelNs14   
CategoryTreeLabelCategory" href="/wiki/Category:C%2B%2B">C++</a>

Right click on this, and click 'copy xpath'. For C++ this will give you

//*[@id="mw-subcategories"]/div/ul[1]/li/div/div[1]/a

Similarly, under the pages, for 'ActionScript' we get

//*[@id="mw-pages"]/div/div/div[1]/ul/li[1]/a

So if you're looking for all the subcategory/page names, you could do, for example

pages = tree.xpath('//*[@id="mw-pages"]/text()')
subcategories = tree.xpath('//*[@id="mw-subcategories"]/text()')

For more information see here and here



来源:https://stackoverflow.com/questions/42495405/how-to-scrape-subcategories-and-pages-in-categories-of-a-category-wikipedia-page

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!