Parsing a Wikipedia dump

前端 未结 9 1262
生来不讨喜
生来不讨喜 2020-12-03 05:33

For example using this Wikipedia dump:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=t

相关标签:
9条回答
  • 2020-12-03 06:04

    I would say look at using Beautiful Soup and just get the Wikipedia page in HTML instead of using the API.

    I'll try and post an example.

    0 讨论(0)
  • 2020-12-03 06:05

    I know the question is old, but I was searching for a library that parses wikipedia xml dump. However, the suggested libraries, wikidump and mwlib, don't offer many code documentation. Then, I found Mediwiki-utilities, which has some code documentation in: http://pythonhosted.org/mediawiki-utilities/.

    0 讨论(0)
  • 2020-12-03 06:08

    Just stumbled over a library on PyPi, wikidump, that claims to provide

    Tools to manipulate and extract data from wikipedia dumps

    I didn't use it yet, so you are on your own to try it...

    0 讨论(0)
提交回复
热议问题