Removing markup links in text

怎甘沉沦 提交于 2021-02-19 05:34:41

问题


I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this: [the text you read](https://website.com/to/go/to). I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to) but keep the text you read.

Here is another example:

[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)

I'd like to keep: the podcast list.

How can I do this with Python's re library? What is the appropriate regex?


回答1:


I have created an initial attempt at your requested regex:

(?<=\[.+\])\(.+\)

The first part (?<=...) is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.

You can extend the above regex to look for only things that have weblinks in the brackets, like so:

(?<=\[.+\])\(https?:\/\/.+\)

The problem with this is that if the link they provide is not started with an http or https it will fail.

After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.


Edit 1:

Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:

\[(.+)\]\(.+\)

You can then substitute the first captured group (in the square brackets) back in using:

re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)

If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).



来源:https://stackoverflow.com/questions/53980097/removing-markup-links-in-text

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!