How to get page id from wikipedia page title

这一生的挚爱 提交于 2020-03-23 13:03:20

问题


I am trying to find the wiki id of list of pages from wikipedia. So, the format is:

input: list of wikipedia page titles

output: list of wikipedia page ids.

So far, I've gone through Mediawiki API to understand how to proceed, but couldn't find a correct way to implement the function. Can anyone suggest how to get the list of page ids?


回答1:


Query basic page information:

import requests

page_titles = ['A', 'B', 'C', 'D']
url = (
    'https://en.wikipedia.org/w/api.php'
    '?action=query'
    '&prop=info'
    '&inprop=subjectid'
    '&titles=' + '|'.join(page_titles) +
    '&format=json')
json_response = requests.get(url).json()

title_to_page_id  = {
    page_info['title']: page_id
    for page_id, page_info in json_response['query']['pages'].items()}

print(title_to_page_id)
print([title_to_page_id[title] for title in page_titles])

This will print:

{'A': '290', 'B': '34635826', 'C': '5200013', 'D': '8123'}
['290', '34635826', '5200013', '8123']

If you have too many titles, you have to query for them in multiple requests because there is a 50 (500 for bots) limit for the number of titles that can be queried at once.




回答2:


The answer provided by AXO works as long as you don't have unnormalized titles such as a category page "Category:Computer_storage_devices" or special characters like &.

In that case you also need to map the response with the normalized titles as following:

def get_page_ids(page_titles):
    import requests
    from requests import utils

    page_titles_encoded = [requests.utils.quote(x) for x in page_titles]

    url = (
        'https://en.wikipedia.org/w/api.php'
        '?action=query'
        '&prop=info'
        '&inprop=subjectid'
        '&titles=' + '|'.join(page_titles_encoded) +
        '&format=json')
    # print(url)
    json_response = requests.get(url).json()
    # print(json_response)

    page_normalized_titles = {x:x for x in page_titles}
    result = {}
    if 'normalized' in json_response['query']:
        for mapping in json_response['query']['normalized']:
            page_normalized_titles[mapping['to']] = mapping['from']

    for page_id, page_info in json_response['query']['pages'].items():
        normalized_title = page_info['title']
        page_title = page_normalized_titles[normalized_title]  
        result[page_title] = page_id

    return result


get_page_ids(page_titles = ['Category:R&J_Records_artists', 'Category:Computer_storage_devices', 'Category:Main_topic_classifications'])

will print

{'Category:R&J_Records_artists': '33352333', 'Category:Computer_storage_devices': '895945', 'Category:Main_topic_classifications': '7345184'}.



来源:https://stackoverflow.com/questions/52787504/how-to-get-page-id-from-wikipedia-page-title

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!