Regex separate urls in text that has no separators

。_饼干妹妹 提交于 2020-01-30 11:28:30

问题


Apologies for yet another regex question!

I have some input text which rather unhelpfully has multiple urls (only urls) all on one line with no separators

https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n

this example contains just two urls, but it could be more.

I'm trying to separate the urls, into a list using python

I've tried searching for solutions and tried a few but can't get this to work exactly, as they greedily consume all following urls. https://stackoverflow.com/a/6883094/659346

I realise that's probably because https://... could probably be legally allowed in the query part of a url, but in my case I'm willing to assume it can't, and assume that when it occurs it's the start of the next url.

I also tried (http[s]://.*?) but that with and without the ? either makes it get the whole bit of text or just the https://


回答1:


You need to use a positive lookahead assertion.

>>> s = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"
>>> re.findall(r'https?://.*?(?=https?://|$|\s)', s)
['https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZg', 'https://console.developers.google.com/project/reducted/?authuser=1']



回答2:


(https?:\/\/(?:(?!https?:\/\/).)*)

Try this.See demo.

https://regex101.com/r/tX2bH4/15

import re
p = re.compile(r'(https?:\/\/(?:(?!https?:\/\/).)*)')
test_str = "https://00e9e64bac25fa94607-apidata.googleusercontent.com/download/redacted?qk=AD5uMEnaGx-JIkLyJmEF7IjjU8bQfv_hZTkH_KOeaGZySsQCmdSPZEPHHAzUaUkcDAOZghttps://console.developers.google.com/project/reducted/?authuser=1\n"

re.findall(p, test_str)


来源:https://stackoverflow.com/questions/27966726/regex-separate-urls-in-text-that-has-no-separators

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!