Extracting URL link using regular expression re - string matching - Python

旧城冷巷雨未停 提交于 2019-11-30 19:46:04

问题


I've been trying to extract URLs from a text file using re api. any link that starts with http:// , https:// and www.

the file contains texts as well as html source code, html part is easy because i can extract them using BeautifulSoup, but normal text seems to be more challenging. I found this online which seems to be the best implementation of URL extraction however it fails on certain tags, specially it can't handle tags and includes them in the URL. any help is appreciated, because I'm not familiar with string matching at all myself

here is the signature

sp1=re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", str(STRING))
sp2=re.findall('www.(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', str(STRING))

examples:

http://www.website.com/science/</span></a><o:p></o:p></span></div><div
www.website.com/library/</span></a></span></i><span
http://awebsite.com/Groups</a><div>

回答1:


re.findall(r'https?://[^\s<>"]+|www\.[^\s<>"]+', str(STRING))

The [^\s<>"]+ part matches any non-whitespace, non quote, non anglebracket character to avoid matching strings like:

<a href="http://www.example.com/stuff">
http://www.example.com/stuff</br>


来源:https://stackoverflow.com/questions/10475027/extracting-url-link-using-regular-expression-re-string-matching-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!