Wikipedia revision history using pywikibot

孤者浪人 提交于 2021-02-19 11:56:17

问题


I want to collect all the revisions history data at once. Pywikibot page.revisions() does not have the parameter to fetch number of bytes changed. It gives me all the data that I need except the number of bytes changed.

How do I get the number of bytes changed?

for example: for the article Main Page the revision history is here: history screenshot

My current code:

import pywikibot

site = pywikibot.Site("en", "wikipedia")
page = pywikibot.Page(site, "Main_Page")
revs = page.revisions()

Showing only 1 output:

first entry:  {'revid': 969106986, '_text': None, 'timestamp': Timestamp(2020, 7, 23, 12, 44, 21), 'user': 'The Blade of the Northern Lights', 'anon': False, 'comment': 'OK, there we go.', 'minor': False, 'rollbacktoken': None, '_parent_id': 969106918, '_content_model': None, '_sha1': 'eb9e0167aabe4145be44305b3775837a37683119', 'slots': {'main': {'contentmodel': 'wikitext'}}}

I need the number of bytes changed which is shown as {+1, -1, +1, -2} in the revision history link which can also be seen in the history screenshot above.


回答1:


Pywikibot uses MW's API to fetch revisions.

The API does not provide changed size for revisions.

Instead of size-change, the API provides a size option for rvprop parameter. One would be able to easily calculate size changes using that.

Unfortunately pywikibot does not fetch size for revisions.

You can file a bug report for pywikibot team.

One may directly use PropertyGenerator class to get revisions with desired properties:

from pywikibot import Site, Page
from pywikibot.data.api import PropertyGenerator
site = Site("en", "wikipedia")
revs = next(iter(PropertyGenerator('revisions', site=site, parameters={
    'titles': 'Main Page',
    'rvprop': 'timestamp|size',
})))['revisions']

print(len(revs))
for rev in revs[:5]:
    print(rev)

The above code will print:

4239
{'timestamp': '2020-07-23T12:44:21Z', 'size': 3500}
{'timestamp': '2020-07-23T12:43:46Z', 'size': 3499}
{'timestamp': '2020-07-23T12:43:31Z', 'size': 3500}
{'timestamp': '2020-06-30T07:05:28Z', 'size': 3499}
{'timestamp': '2020-06-22T13:37:29Z', 'size': 3501}

Old answer: as noted in the comments, this method does not handle API continuations and therefore is not recommended if you need all revisions of a page.

import pywikibot
from pywikibot.data.api import Request
site = pywikibot.Site("en", "wikipedia")
r = Request(site, parameters={
    'action': 'query',
    'titles': 'Main Page',
    'prop': 'revisions',
    'rvprop': 'timestamp|size',
    'rvlimit': 5,
}).submit()
pages = r['query']['pages']
for page_id, page_info in pages.items():
    for rev in page_info['revisions']:
        print(rev)

The above code will print:

{'timestamp': '2020-07-23T12:44:21Z', 'size': 3500}
{'timestamp': '2020-07-23T12:43:46Z', 'size': 3499}
{'timestamp': '2020-07-23T12:43:31Z', 'size': 3500}
{'timestamp': '2020-06-30T07:05:28Z', 'size': 3499}
{'timestamp': '2020-06-22T13:37:29Z', 'size': 3501}



回答2:


See https://phabricator.wikimedia.org/T259428.

The patch was merged to master branch and will deployed in release 5.2.0 via pypi.




回答3:


There is a better way to AXO's proposal:

import pywikibot
site = pywikibot.Site('wikipedia:en')
page = pywikibot.Page(site, 'Main Page')
for rev in page.revisions(total=5):
    # do whatever you want with Revision Collection rev
    print(dict(timestamp=str(rev.timestamp), size=rev.size))

The code will print as expected:

{'timestamp': '2021-02-03T11:11:30Z', 'size': 3508}
{'timestamp': '2021-02-03T11:03:39Z', 'size': 3480}
{'timestamp': '2020-11-10T08:18:07Z', 'size': 3508}
{'timestamp': '2020-11-10T02:32:23Z', 'size': 4890}
{'timestamp': '2020-11-10T00:46:58Z', 'size': 4880}


来源:https://stackoverflow.com/questions/63213660/wikipedia-revision-history-using-pywikibot

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!