Read Wikipedia piped links

我只是一个虾纸丫 提交于 2020-01-17 04:55:06

问题


I'm using java and I want to read piped links from Wikipedia that has a specific surface form. Fir example in this form [America|US] the surface form is "US" and the internal link is "America".

The straightforward solution is to read the xml dump of Wikipedia and find the strings that matches the regular expression for a piped link. However I am afraid that I wouldn't cover all the possible regular expressions of a piped link. I searched and I couldn't find any library that specifically give me the piped links.

Any suggestions?


回答1:


edit

Now that I understand the question: I don't think there is a way to get all internal links with their printout value. This is simply not stored in the database (only links are), because the actual output is only created when the page is rendered.

You would have to parse the pages yourself to be sure to get all links. Of course, if you can accept getting only the subset of links available in the wikitext of each page, parsing the xml dump as you suggests would work. Note that one single regex will most likely not distinguish between piped internal links, and piped interwiki links. Also beware of image links, that use pipes for variable separation (e.g. [[Image:MyImage.jpeg|thumb|left|A caption!]]).

Here is the regex used by the MediaWiki parser:

$tc = Title::legalChars() . '#%';
# Match a link having the form [[namespace:link|alternate]]trail
$e1 = "/^([{$tc}]+)(?:\\|(.+?))?]](.*)\$/sD";
# Match cases where there is no "]]", which might still be images
$e1_img = "/^([{$tc}]+)\\|(.*)\$/sD";

However, this codes is applied after a lot of preprocessing has happened.

Old answer

Using a xml dump will not give you all links, as many links are produced by templates, or in some cases even parser functions. A simpler way would be to use the API:

https://en.wikipedia.org/w/api.php?action=query&titles=Stack_Overflow&prop=links&redirects

I am assuming English Wikipedia here, but it will work anywhere, just substitute en. in the url for your language code. The redirects directive will, quite obviously, make sure to follow redirects. In the same way, use prop=extlinks to get external links:

https://en.wikipedia.org/w/api.php?action=query&titles=Stack_Overflow&prop=extlinks&redirects

You can grab links for multiple pages at once, either by separating their name with a pipe character, like this: Stack_Overflow|Chicago, or by using a generator, e.g. allpages (to run the query against every single page in the wiki), like this:

https://en.wikipedia.org/w/api.php?action=query&generator=allpages&prop=links

The number of results returned by the allpages generator can be raise by setting the gaplimit parameter, e.g. &gaplimit=50 to get all external links for the first 50 pages. If you request bot status at the Wikipedia edition you are looking at, you can get as high as 5000 results per request, otherwise the maximum is 500 for most (probably all) Wikipedias.



来源:https://stackoverflow.com/questions/27178468/read-wikipedia-piped-links

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!