How do I grab just the parsed Infobox of a wikipedia article?

后端未结

关注

 8  1679

I\'m still stuck on my problem of trying to parse articles from wikipedia. Actually I wish to parse the infobox section of articles from wikipedia i.e. my application has re

相关标签:

8条回答

余生分开走

2020-12-16 04:07

I'd use the wikipedia (wikimedia) API. You can get data back in JSON, XML, php native format, and others. You'll then still need to parse the returned information to extract and format the info you want, but the info box start, stop, and information types are clear.

Run your query for just rvsection=0, as this first section gets you the material before the first section break, including the infobox. Then you'll need to parse the infobox content, which shouldn't be too hard. See en.wikipedia.org/w/api.php for the formal wikipedia api documentation, and www.mediawiki.org/wiki/API for the manual.

Run, for example, the query: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0

0 讨论(0)
发布评论:

提交评论
- 加载中...
后悔当初

2020-12-16 04:07

if you want to parse one time all the articles, wikipedia has all the articles in xml format available,

http://en.wikipedia.org/wiki/Wikipedia_database

otherwise you can screen scrape individual articles e.g.

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-16 04:14

I suggest performing a WebRequest against wikipedia. From there you will have the page and you can simply parse or query out the data that you need using a regex, character crawl, or some other form that you are familiar with. Essentially a screen scrape!

EDIT - I would add to this answer that you can use HtmlAgilityPack for those in C# land. For PHP it looks like SimpleHtmlDom. Having said that it looks like Wikipedia has a more than adequate API. This question probably answers your needs best:

Is there a Wikipedia API?

0 讨论(0)
发布评论:

提交评论
- 加载中...
孤街浪徒

2020-12-16 04:15
It depends what route you want to go. Here are some possibilities:
1. Install MediaWiki with appropriate modifications. It is a after all a PHP app designed precisely to parse wikitext...
2. Download the static HTML version, and parse out the parts you want.
3. Use the Wikipedia API with appropriate caching.
DO NOT just hit the latest version of the live page and redo the parsing every time your app wants the box. This is a huge waste of resources for both you and Wikimedia.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情话喂你

2020-12-16 04:18

I suggest you use DBPedia instead which has already done the work of turning the data in wikipedia into usable, linkable, open forms.

0 讨论(0)
发布评论:

提交评论
- 加载中...

说谎

2020-12-16 04:23

To load the parsed first section, Simply add this parameter to the end of the api url

rvparse

Like this: http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=xmlfm&titles=fortran&rvsection=0&rvparse

Then parse the html to get the infobox table (using Regex)

    $url = "http://en.wikipedia.org/w/api.php?action=query&prop=revisions&rvprop=content&format=json&titles=Niger&rvsection=0&rvparse";
    $data = json_decode(file_get_contents($url), true);
    $data = current($data['query']['pages']);
    $regex = '#<\s*?table\b[^>]*>(.*)</table\b[^>]*>#s';
    $code = preg_match($regex, $data["revisions"][0]['*'], $matches);
    echo($matches[0]);

0 讨论(0)

1 2 下一页