Extract all urls in a string with python3

后端 未结 5 1866
执念已碎
执念已碎 2020-12-22 08:46

I am trying to find a clean way to extract all urls in a text string.

After an extensive search, i have found many posts suggesting using regular expressions to do t

相关标签:
5条回答
  • 2020-12-22 09:12

    If you want a regex, you can use this:

    import re
    
    
    string = "Lorem ipsum dolor sit amet https://www.lorem.com/ipsum.php?q=suas, nusquam tincidunt ex per, ius modus integre no, quando utroque placerat qui no. Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. Elit pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org                                                                    
    0 讨论(0)
  • 2020-12-22 09:15

    Apart from what others mentioned, since you've asked for something that already exists, you might want to try URLExtract.

    Apparently it tries to find any occurrence of TLD in given text. If TLD is found, it starts from that position to expand boundaries to both sides searching for a "stop character" (usually white space, comma, single or double quote).

    You have a couple of examples here.

    from urlextract import URLExtract
    
    extractor = URLExtract()
    urls = extractor.find_urls("Let's have URL youfellasleepwhilewritingyourtitle.com as an example.")
    print(urls) # prints: ['youfellasleepwhilewritingyourtitle.cz']
    

    It seems that this module also has an update() method which lets you update the TLD list cache file

    However, if that doesn't fit you specific requirements, you can manually do some checks after you've processed the urls using the above module (or any other way of parsing the URLs). For example, say you get a list of the URLs:

    result = ['https://www.lorem.com/ipsum.php?q=suas', 'https://www.lorem.org', 'http://news.bbc.co.uk'] 
    

    You can then build another lists which hold the excluded domains / TLDs / etc:

    allowed_protocols = ['protocol_1', 'protocol_2']
    allowed_tlds = ['tld_1', 'tld_2', 'tld_3']
    allowed_domains = ['domain_1']
    
    for each_url in results:
        # here, check each url against your rules
    
    0 讨论(0)
  • 2020-12-22 09:19

    Using an existing library is probably the best solution.

    But it was too much for my tiny script, and -- inspired by @piotr-wasilewiczs answer-- I came up with:

    from string import ascii_letters
    links = [x for x in line.split() if x.strip(str(set(x) - set(ascii_letters))).startswith(('http', 'https', 'www'))]
    
    • for each word in the line,
    • strip (from the beginning and the end) the non ASCII letters found in the word itself)
    • and filter by the words starting with one of https, http, www.

    A bit too dense for my taste and I have no clue how fast it is, but it should detect most "sane" urls in a string.

    0 讨论(0)
  • 2020-12-22 09:24
    output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
    print(output)
    

    your example: http://ideone.com/wys57x

    After all you can also cut last character in elements of list if it is not a letter.

    EDIT:

    output = [x for x in input().split() if x.startswith('http://') or x.startswith('https://') or x.startswith('ftp://')]
    newOutput = []
    for link in output:
        copy = link
        while not copy[-1].isalpha():
            copy = copy[:-1]
        newOutput.append(copy)
    print(newOutput)
    

    Your example: http://ideone.com/gHRQ8w

    0 讨论(0)
  • 2020-12-22 09:27
    import re
    import string
    text = """
    Lorem ipsum dolor sit amet https://www.lore-m.com/ipsum.php?q=suas, 
    nusquam tincidunt ex per, ftp://link.com ius modus integre no, quando utroque placerat qui no. 
    Mea conclusionemque vituperatoribus et, omnes malorum est id, pri omnes atomorum expetenda ex. 
    Elit ftp://link.work.in pertinacia no eos, nonumy comprehensam id mei. Ei eum maiestatis quaerendum https://www.lorem.org                                                                    
    0 讨论(0)
提交回复
热议问题