How to obtain a list of titles of all Wikipedia articles

元气小坏坏 提交于 2020-05-09 19:31:56

问题


I'd like to obtain a list of all the titles of all Wikipedia articles. I know there are two possible ways to get content from a Wikimedia powered wiki. One would be the API and the other one would be a database dump.

I'd prefer not to download the wiki dump. First, it's huge, and second, I'm not really experienced with querying databases. The problem with the API on the other hand is that I couldn't figure out a way to only retrieve a list of the article titles and even if it would need > 4 mio requests which would probably get me blocked from any further requests anyway.

So my question is

  1. Is there a way to obtain only the titles of Wikipedia articles via the API?
  2. Is there a way to combine multiple request/queries into one? Or do I actually have to download a Wikipedia dump?

回答1:


The allpages API module allows you to do just that. Its limit (when you set aplimit=max) is 500, so to query all 4.5M articles, you would need about 9000 requests.

But a dump is a better choice, because there are many different dumps, including all-titles-in-ns0 which, as its name suggests, contains exactly what you want (59 MB of gzipped text).




回答2:


Right now, as per the current statistics the number of articles is around 5.8M. To get the list of pages I did use the AllPages API. However, the number of pages I get is around 14.5M which is ~3 times of what I was expecting. I restricted myself to namespace 0 to get the list. Following is the sample code that I am using:

# get the list of all wikipedia pages (articles) -- English
import sys
from simplemediawiki import MediaWiki

listOfPagesFile = open("wikiListOfArticles_nonredirects.txt", "w")


wiki = MediaWiki('https://en.wikipedia.org/w/api.php')

continueParam = ''
requestObj = {}
requestObj['action'] = 'query'
requestObj['list'] = 'allpages'
requestObj['aplimit'] = 'max'
requestObj['apnamespace'] = '0'

pagelist = wiki.call(requestObj)
pagesInQuery = pagelist['query']['allpages']

for eachPage in pagesInQuery:
    pageId = eachPage['pageid']
    title = eachPage['title'].encode('utf-8')
    writestr = str(pageId) + "; " + title + "\n"
    listOfPagesFile.write(writestr)

numQueries = 1

while len(pagelist['query']['allpages']) > 0:

    requestObj['apcontinue'] = pagelist["continue"]["apcontinue"]
    pagelist = wiki.call(requestObj)


    pagesInQuery = pagelist['query']['allpages']

    for eachPage in pagesInQuery:
        pageId = eachPage['pageid']
        title = eachPage['title'].encode('utf-8')
        writestr = str(pageId) + "; " + title + "\n"
        listOfPagesFile.write(writestr)
        # print writestr


    numQueries += 1

    if numQueries % 100 == 0:
        print "Done with queries -- ", numQueries
        print numQueries

listOfPagesFile.close()

The number of queries fired is around 28900, which results in approx. 14.5M names of the pages.

I also tried the all-titles link mentioned in the above answer. In that case as well I am getting around 14.5M pages.

I thought that this overestimate to the actual number of pages is because of the redirects, and did add the 'nonredirects' option to the request object:

requestObj['apfilterredir'] = 'nonredirects'

After doing that I get only 112340 number of pages. Which is too small as compared to 5.8M.

With the above code I was expecting roughly 5.8M pages, but that doesn't seem to be the case.

Is there any other option that I should be trying to get the actual (~5.8M) set of page names?



来源:https://stackoverflow.com/questions/24474288/how-to-obtain-a-list-of-titles-of-all-wikipedia-articles

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!