Fetch excerpt from Wikipedia article?

后端 未结 4 1381
刺人心
刺人心 2020-12-08 17:37

I\'ve been up and down the Wikipedia API, but I can\'t figure out if there\'s a nice way to fetch the excerpt of an article (usually the first paragraph). It would

相关标签:
4条回答
  • 2020-12-08 17:40

    As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the prop=extracts API query.

    For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.

    The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.

    0 讨论(0)
  • 2020-12-08 17:42

    It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0 as explained here.

    Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:

    // remove templates (even nested)
    do {
        $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
    } while ($count > 0);
    // remove HTML comments
    $c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
    // remove links
    $c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
    $c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
    // remove footnotes
    $c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
    // remove leading and trailing spaces
    $c = trim($c);
    // convert bold and italic
    $c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
    $c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
    // add newlines
    if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);
    
    0 讨论(0)
  • 2020-12-08 17:57

    Use this link to get the unparsed intro in xml form "http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"

    Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..

    0 讨论(0)
  • 2020-12-08 18:04

    I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:

    $doc = new DOMDocument();
    $doc->loadHTML($wikiPage);
    $xpath = new DOMXpath($doc);
    $nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
    $nFirstP = $nlPNodes->item(0);
    $sFirstP = $doc->saveXML($nFirstP);
    echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>
    
    0 讨论(0)
提交回复
热议问题