Parsing a Wikipedia dump

前端 未结 9 1261
生来不讨喜
生来不讨喜 2020-12-03 05:33

For example using this Wikipedia dump:

http://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=lebron%20james&rvprop=content&redirects=t

相关标签:
9条回答
  • 2020-12-03 05:51

    WikiExtractor appears to be a clean, simple, and efficient way to do this in Python today: https://github.com/attardi/wikiextractor

    It provides an easy way to parse a Wikipedia dump into a simple file structure like so:

    <doc>...</doc>
    <doc>...</doc>
    ...
    <doc>...</doc>
    

    ...where each doc looks like:

    <doc id="2" url="http://it.wikipedia.org/wiki/Harmonium">
    Harmonium.
    L'harmonium è uno strumento musicale azionato con una tastiera, detta manuale.
    Sono stati costruiti anche alcuni harmonium con due manuali.
    ...
    </doc>
    
    0 讨论(0)
  • 2020-12-03 05:51

    There's some information on Python and XML libraries here.

    If you're asking is there an existing library that's designed to parse Wiki(pedia) XML specifically and match your requirements, this is doubtful. However you can use one of the existing libraries to traverse the DOM and pull out the data you need.

    Another option is to write an XSLT stylesheet that does similar and call it using lxml. This also lets you make calls to Python functions from inside the XSLT so you get the best of both worlds.

    0 讨论(0)
  • 2020-12-03 05:52

    I know this is an old question, but I here is this great script that reads the wiki dump xml and outputs a very nice csv:

    PyPI: https://pypi.org/project/wiki-dump-parser/

    GitHub: https://github.com/Grasia/wiki-scripts/tree/master/wiki_dump_parser

    0 讨论(0)
  • 2020-12-03 05:55

    You're probably looking for the Pywikipediabot for manipulating the wikipedia API.

    0 讨论(0)
  • 2020-12-03 06:01

    It looks like you really want to be able to parse MediaWiki markup. There is a python library designed for this purpose called mwlib. You can use python's built-in XML packages to extract the page content from the API's response, then pass that content into mwlib's parser to produce an object representation that you can browse and analyse in code to extract the information you want. mwlib is BSD licensed.

    0 讨论(0)
  • 2020-12-03 06:01

    I described how to do this using a combination of pywikibot and mwparserfromhell in this post (don't have enough reputation yet to flag as a duplicate).

    In [1]: import mwparserfromhell
    
    In [2]: import pywikibot
    
    In [3]: enwp = pywikibot.Site('en','wikipedia')
    
    In [4]: page = pywikibot.Page(enwp, 'Waking Life')            
    
    In [5]: wikitext = page.get()               
    
    In [6]: wikicode = mwparserfromhell.parse(wikitext)
    
    In [7]: templates = wikicode.filter_templates()
    
    In [8]: templates?
    Type:       list
    String Form:[u'{{Use mdy dates|date=September 2012}}', u"{{Infobox film\n| name           = Waking Life\n| im <...> critic film|waking-life|Waking Life}}', u'{{Richard Linklater}}', u'{{DEFAULTSORT:Waking Life}}']
    Length:     31
    Docstring:
    list() -> new empty list
    list(iterable) -> new list initialized from iterable's items
    
    In [10]: templates[:2]
    Out[10]: 
    [u'{{Use mdy dates|date=September 2012}}',
     u"{{Infobox film\n| name           = Waking Life\n| image          = Waking-Life-Poster.jpg\n| image_size     = 220px\n| alt            =\n| caption        = Theatrical release poster\n| director       = [[Richard Linklater]]\n| producer       = [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West\n| writer         = Richard Linklater\n| starring       = [[Wiley Wiggins]]\n| music          = Glover Gill\n| cinematography = Richard Linklater<br />[[Tommy Pallotta]]\n| editing        = Sandra Adair\n| studio         = [[Thousand Words]]\n| distributor    = [[Fox Searchlight Pictures]]\n| released       = {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}\n| runtime        = 101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>\n| country        = United States\n| language       = English\n| budget         =\n| gross          = $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>\n}}"]
    
    In [11]: infobox_film = templates[1]
    
    In [12]: for param in infobox_film.params:
                 print param.name, param.value
    
     name             Waking Life
    
     image            Waking-Life-Poster.jpg
    
     image_size       220px
    
     alt             
    
     caption          Theatrical release poster
    
     director         [[Richard Linklater]]
    
     producer         [[Tommy Pallotta]]<br />[[Jonah Smith]]<br />Anne Walker-McBay<br />Palmer West
    
     writer           Richard Linklater
    
     starring         [[Wiley Wiggins]]
    
     music            Glover Gill
    
     cinematography   Richard Linklater<br />[[Tommy Pallotta]]
    
     editing          Sandra Adair
    
     studio           [[Thousand Words]]
    
     distributor      [[Fox Searchlight Pictures]]
    
     released         {{Film date|2001|01|23|[[Sundance Film Festival|Sundance]]|2001|10|19|United States}}
    
     runtime          101 minutes<!--Theatrical runtime: 100:40--><ref>{{cite web |title=''WAKING LIFE'' (15) |url=http://www.bbfc.co.uk/releases/waking-life-2002-3|work=[[British Board of Film Classification]]|date=September 19, 2001|accessdate=May 6, 2013}}</ref>
    
     country          United States
    
     language         English
    
     budget          
    
     gross            $3,176,880<ref>{{cite web|title=''Waking Life'' (2001)|work=[[Box Office Mojo]] |url=http://www.boxofficemojo.com/movies/?id=wakinglife.htm|accessdate=March 20, 2010}}</ref>
    

    Don't forget that params are mwparserfromhell objects too!

    0 讨论(0)
提交回复
热议问题