Fetch excerpt from Wikipedia article?

孤人 提交于 2019-11-28 07:01:14

Use this link to get the unparsed intro in xml form "http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=extracts&exsentences=10&titles=Aati kalenja"

Earlier I could get the introduction of a list of topics/articles from a category in a single page by adding iframes with src like the above link.. But now chrome is throwing this error - "Refused to display document because display forbidden by X-Frame-Options." Any way through? Pls help..

I found no way of doing this through the API, so I resorted to parsing HTML, using PHP's DOM functions. This was pretty easy, something among the lines of:

$doc = new DOMDocument();
$doc->loadHTML($wikiPage);
$xpath = new DOMXpath($doc);
$nlPNodes = $xpath->query('//div[@id="bodyContent"]/p');
$nFirstP = $nlPNodes->item(0);
$sFirstP = $doc->saveXML($nFirstP);
echo $sFirstP; // echo the first paragraph of the wiki article, including <p></p>

As ARAVIND VR notes, on wikis running the MobileFrontend extension — which includes Wikipedia — you can easily get an excerpt of an article via the MediaWiki API by using the prop=extracts API query.

For example, this link will give you a short excerpt of the Stack Overflow article on Wikipedia in a JSON wrapper.

The various options to the query can be used to control the excerpt format (HTML or plain text), its maximum length (in characters and/or sentences, and optionally restricting it to the intro section of the article) and the formatting of section headings in the output. It's also possible to obtain intro extracts from more than one article in a single query.

It's possible to get only the "introduction" of the article using the API, with the parameter rvsection=0 as explained here.

Converting Wiki-text to HTML is a bit more difficult; I guess there are more complete/official methods, but this is what I ended up doing:

// remove templates (even nested)
do {
    $c = preg_replace('/[{][{][^{}]+[}][}]\n?/', '', $c, -1, $count);
} while ($count > 0);
// remove HTML comments
$c = preg_replace('/<!--(?:[^-]|-[^-]|[[[^>])+-->\n?/', '', $c);
// remove links
$c = preg_replace('/[[][[](?:[^]|]+[|])?([^]]+)[]][]]/', '$1', $c);
$c = preg_replace('/[[]http[^ ]+ ([^]]+)[]]/', '$1', $c);
// remove footnotes
$c = preg_replace('#<ref(?:[^<]|<[^/])+</ref>#', '', $c);
// remove leading and trailing spaces
$c = trim($c);
// convert bold and italic
$c = preg_replace("/'''((?:[^']|'[^']|''[^'])+)'''/", $html ? '<b>$1</b>' : '$1', $c);
$c = preg_replace("/''((?:[^']|'[^'])+)''/", $html ? '<i>$1</i>' : '$1', $c);
// add newlines
if ($html) $c = preg_replace('/(\n)/', '<br/>$1', $c);
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!