Python regex for finding contents of MediaWiki markup links

爷,独闯天下 提交于 2019-12-04 11:42:24

Here is an example

import re

pattern = re.compile(r"\[\[([\w \|]+)\]\]")
text = "blah blah [[Alexander of Paris|poet named Alexander]] bldfkas"
results = pattern.findall(text)

output = []
for link in results:
    output.append(link.split("|")[0])

# outputs ['Alexander of Paris']

Version 2, puts more into the regex, but as a result, changes the output:

import re

pattern = re.compile(r"\[\[([\w ]+)(\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)

# outputs [('a', '|b'), ('c', '|d'), ('efg', '')]

print [link[0] for link in results]

# outputs ['a', 'c', 'efg']

Version 3, if you only want the link without the title.

pattern = re.compile(r"\[\[([\w ]+)(?:\|[\w ]+)?\]\]")
text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"
results = pattern.findall(text)

# outputs ['a', 'c', 'efg']

RegExp: \w+( \w+)+(?=]])

input

[[Alexander of Paris|poet named Alexander]]

output

poet named Alexander

input

[[Alexander of Paris]]

output

Alexander of Paris

import re
pattern = re.compile(r"\[\[([\w ]+)(?:\||\]\])")
text = "of which [[Alexander the Great]] was somewhat like [[King Arthur|Arthur]]"
results = pattern.findall(text)
print results

Would give the output

["Alexander the Great", "King Arthur"]

If you are trying to get all the links from a page, of course it is much easier to use the MediaWiki API if at all possible, e.g. http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website).

Note that both these methods miss links embedded in templates.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!