发表新帖

发表新帖

Extract all urls in a string with python3

后端未结

关注

 5  1865

执念已碎 2020-12-22 08:46

I am trying to find a clean way to extract all urls in a text string.

After an extensive search, i have found many posts suggesting using regular expressions to do t

5条回答

小蘑菇 (楼主)

2020-12-22 09:15
Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract.

Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).

You have a couple of examples here.
```
from urlextract import URLExtract

extractor = URLExtract()
urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']
```
It seems that this module also has an update() method which lets you update the TLD list cache file

However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:
```
result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk'] 
```
You can then build another lists which hold the excluded domains / TLDs / etc:
```
allowed_protocols = ['protocol_1', 'protocol_2']
allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
allowed_domains = ['domain_1']

for each_url in results:
    # here, check each url against your rules
```
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...

热议问题