How do I grab just the parsed Infobox of a wikipedia article?

后端 未结 8 1659
萌比男神i
萌比男神i 2020-12-16 04:05

I\'m still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has re

相关标签:
8条回答
  • 2020-12-16 04:07

    I'd use the wikipedia (wikimedia) API. You can get data back in JSON, XML, php native format, and others. You'll then still need to parse the returned information to extract and format the info you want, but the info box start, stop, and information types are clear.

    Run your query for just rvsection=0, as this first section gets you the material before the first section break, including the infobox. Then you'll need to parse the infobox content, which shouldn't be too hard. See en.wikipedia.org/w/api.php for the formal wikipedia api documentation, and www.mediawiki.org/wiki/API for the manual.

    Run, for example, the query: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0

    0 讨论(0)
  • 2020-12-16 04:07

    if you want to parse one time all the articles, wikipedia has all the articles in xml format available,

    http://en.wikipedia.org/wiki/Wikipedia_database

    otherwise you can screen scrape individual articles e.g.

    0 讨论(0)
  • 2020-12-16 04:14

    I suggest performing a WebRequest against wikipedia. From there you will have the page and you can simply parse or query out the data that you need using a regex, character crawl, or some other form that you are familiar with. Essentially a screen scrape!

    EDIT - I would add to this answer that you can use HtmlAgilityPack for those in C# land. For PHP it looks like SimpleHtmlDom. Having said that it looks like Wikipedia has a more than adequate API. This question probably answers your needs best:

    Is there a Wikipedia API?

    0 讨论(0)
  • 2020-12-16 04:15

    It depends what route you want to go. Here are some possibilities:

    1. Install MediaWiki with appropriate modifications. It is a after all a PHP app designed precisely to parse wikitext...
    2. Download the static HTML version, and parse out the parts you want.
    3. Use the Wikipedia API with appropriate caching.

    DO NOT just hit the latest version of the live page and redo the parsing every time your app wants the box. This is a huge waste of resources for both you and Wikimedia.

    0 讨论(0)
  • 2020-12-16 04:18

    I suggest you use DBPedia instead which has already done the work of turning the data in wikipedia into usable, linkable, open forms.

    0 讨论(0)
  • 2020-12-16 04:23

    To load the parsed first section, Simply add this parameter to the end of the api url

    rvparse
    

    Like this: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0&rvparse

    Then parse the html to get the infobox table (using Regex)

        $url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Niger&rvsection=0&rvparse";
        $data = json_decode(file_get_contents($url), true);
        $data = current($data['query']['pages']);
        $regex = '#<\s*?table\b[^>]*>(.*)</table\b[^>]*>#s';
        $code = preg_match($regex, $data["revisions"][0]['*'], $matches);
        echo($matches[0]);
    
    0 讨论(0)
提交回复
热议问题