问题
Using Wikipedia's dumps I want to build a hierarchy for its categories. I have downloaded the main dump (enwiki-latest-pages-articles) and the category SQL dump (enwiki-latest-category). But I can't find the hierarchy information.
For example, the SQL categories' dump has entries for each category but I can't find anything about how they relate to each other.
The other dump (latest-pages-articles) says the parent categories for each page but in an unordered way. It just states all the parents.
I have seen wikiprep's category hierarchy (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)... How is that one constructed? Wikiprep lists the category ID, not its name. Is there a way to get the name for each ID?
回答1:
The category hierarchy information in MediaWiki is stored in the categorylinks table, so you're going to need the categorylinks
dump.
You're also going to need the page
(not pages-articles
) dump for page id to title mapping.
回答2:
Loading the dump of category links etc... to build a wikipedia hierarchy is very long (even if interesting).
I found fast path that give good result. I rely on wikipedia vital articles hierarchy. See for instance, sensimark for an example use.
来源:https://stackoverflow.com/questions/17432254/wikipedia-category-hierarchy-from-dumps