问题
I want to get all the articles names under a category and its sub-categories.
Options I'm aware of:
- Using the Wikipedia API. Does it have such an option??
- d/l the dump. Which format would be better for my usage?
- There is also an option to search in Wikipedia something like
incategory:"music"
, but I didn't see an option to view that in XML.
Please share your thoughts
回答1:
The following resource will help you to download all pages from the category and all its subcategories:
http://en.wikipedia.org/wiki/Wikipedia:CatScan
There is also an API available here:
https://www.mediawiki.org/wiki/API:Categorymembers
回答2:
You can do this through the following two API methods:
For articles pages for this category
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtitle=Category:Music
For get subcategories:
YOUR_URL/api.php?action=query&format=json&list=categorymembers&cmtype=subcat&cmtitle=Category:Music
You can get more info on Mediawiki API
回答3:
Note that Wikipedia's categorization system is not a tree, or even an acyclic graph. It is quite possible that by continually following subcategory links you will eventually wind up back where you started.
If you are going to be making many such queries, you would be best served by downloading a database dump. If this will be an infrequent thing and will only be dealing with small categories, you could probably get away with making repeated queries to list=categorymembers.
incategory:"music"
does not appear to do subcategory searching.
来源:https://stackoverflow.com/questions/5771745/how-to-get-all-article-pages-under-a-wikipedia-category-and-its-sub-categories