Regex to parse out a part of URL

假装没事ソ 提交于 2021-02-05 08:01:24

问题


I am having the following data,

data
http://hsotname.com/2016/08/a-b-n-r-y-u
https://www.hostname.com/best-food-for-humans
http://www.hostname.com/wp-content/uploads/2014/07/a-w-w-2.jpg
http://www.hostname.com/a/geniusbar/
http://www.hsotname.com/m/
http://www.hsotname.com/

I want to avoid the first http:// or https:// and check for the last '/' and parse out the remaining parts of the URL. But the challenge here is, we have '/' on the end of few URLs as well. The output which I want is,

parsed
a-b-n-r-y-u
best-food-for-humans
a-w-w-2.jpg
NULL
NULL 
NULL

Can anybody help me to find the last / and parse out the remaining part of the URL? I am new to regex and any help would be appreciated.

Thanks


回答1:


Another option is to simply split on "/" and take the last element:

"http://hsotname.com/2016/08/a-b-n-r-y-u".split("/")[-1]
# 'a-b-n-r-y-u'

"http://www.hostname.com/a/geniusbar/".split("/")[-1]
# ''



回答2:


Regexes are probably not the way you should do this - considering that you marked the question python, try (assuming the URL is in name url):

last-part = url.split('/')[-1]

This splits the URL into a list of substrings between slashes, and stores the last one in last-part.

If you insist on using regexes, though, matching on the end of the string is helpful here. Try /[^/]*$, which matches a slash, followed by any number of non-slashes, followed by the end of the string.

If you were to want to match the last non-empty part following a slash (if you didn't want the last three examples to return ""), you could do /[^/]*/?$, which allows but does not require a single slash at the very end.




回答3:


I'd go with something like this:

\/([^/]*)$

It'll match the last slash, then grab anything after it (if anything) that isn't a slash.




回答4:


Regex isn't the best tool in this case. Just use str.rfind:

[url[url.rfind('/'):] for url in data]

Will give you what you're looking for




回答5:


Possibly over kill for the example, but if you need to deal with location fragments/just location names (ie, the last forward slash is part of the http etc... (splitting http://hostname.com and taking the last / will give you hostname.com - urlsplit will give a path of '' instead) then'll you're probably safer off using:

>>> from urllib.parse import urlsplit
>>> urls = ['http://hsotname.com/2016/08/a-b-n-r-y-u', 'https://www.hostname.com/best-food-for-humans', 'http://www.hostname.com/wp-content/uploads/2014/07/a-w-w-2.jpg', 'http://www.hostname.com/a/geniusbar/', 'http://www.hsotname.com/m/', 'http://www.hsotname.com/']
>>> [urlsplit(url).path.rpartition('/')[2] for url in urls]
['a-b-n-r-y-u', 'best-food-for-humans', 'a-w-w-2.jpg', '', '', '']



回答6:


Check from the end of the URL, and match every thing but /

[^/]+?$

or

\b[^/]+?\b$


来源:https://stackoverflow.com/questions/39233526/regex-to-parse-out-a-part-of-url

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!