replace URLs in text with links to URLs

后端 未结 5 1901
逝去的感伤
逝去的感伤 2020-12-16 21:40

Using Python I want to replace all URLs in a body of text with links to those URLs, like what Gmail does. Can this be done in a one liner regular expression?

Edit: b

相关标签:
5条回答
  • 2020-12-16 22:02

    I hunted around a lot, tried these solutions and was not happy with their readability or features, so I rolled the following:

    _urlfinderregex = re.compile(r'http([^\.\s]+\.[^\.\s]*)+[^\.\s]{2,}')
    
    def linkify(text, maxlinklength):
        def replacewithlink(matchobj):
            url = matchobj.group(0)
            text = unicode(url)
            if text.startswith('http://'):
                text = text.replace('http://', '', 1)
            elif text.startswith('https://'):
                text = text.replace('https://', '', 1)
    
            if text.startswith('www.'):
                text = text.replace('www.', '', 1)
    
            if len(text) > maxlinklength:
                halflength = maxlinklength / 2
                text = text[0:halflength] + '...' + text[len(text) - halflength:]
    
            return '<a class="comurl" href="' + url + '" target="_blank" rel="nofollow">' + text + '<img class="imglink" src="/images/linkout.png"></a>'
    
        if text != None and text != '':
            return _urlfinderregex.sub(replacewithlink, text)
        else:
            return ''
    

    You'll need to get a link out image, but that's pretty easy. This is specifically for user submitted text like comments which I assume is usually what people are dealing with.

    0 讨论(0)
  • 2020-12-16 22:08
    /\w+:\/\/[^\s]+/
    
    0 讨论(0)
  • 2020-12-16 22:10

    When you say "body of text" do you mean a plain text file, or body text in an HTML document? If you want the HTML document, you will want to use Beautiful Soup to parse it; then, search through the body text and insert the tags.

    Matching the actual URLs is probably best done with the urlparse module. Full discussion here: How do you validate a URL with a regular expression in Python?

    0 讨论(0)
  • 2020-12-16 22:19

    Gmail is a lot more open, when it comes to URLs, but it is not always right either. e.g. it will make www.a.b into a hyperlink as well as http://a.b but it often fails because of wrapped text and uncommon (but valid) URL characters.

    See appendix A. A. Collected BNF for URI for syntax, and use that to build a reasonable regular expression that will consider what surrounds the URL as well. You'd be well advised to consider a couple of scenarios where URLs might end up.

    0 讨论(0)
  • 2020-12-16 22:23

    You can load the document up with a DOM/HTML parsing library ( see html5lib ), grab all text nodes, match them against a regular expression and replace the text nodes with a regex replacement of the URI with anchors around it using a PCRE such as:

    /(https?:[;\/?\\@&=+$,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%][\;\/\?\:\@\&\=\+\$\,\[\]A-Za-z0-9\-_\.\!\~\*\'\(\)%#]*|[KZ]:\\*.*\w+)/g
    

    I'm quite sure you can scourge through and find some sort of utility that does this, I can't think of any off the top of my head though.

    Edit: Try using the answers here: How do I get python-markdown to additionally "urlify" links when formatting plain text?

    import re
    
    urlfinder = re.compile("([0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}|((news|telnet|nttp|file|http|ftp|https)://)|(www|ftp)[-A-Za-z0-9]*\\.)[-A-Za-z0-9\\.]+):[0-9]*)?/[-A-Za-z0-9_\\$\\.\\+\\!\\*\\(\\),;:@&=\\?/~\\#\\%]*[^]'\\.}>\\),\\\"]")
    
    def urlify2(value):
        return urlfinder.sub(r'<a href="\1">\1</a>', value)
    

    call urlify2 on a string and I think that's it if you aren't dealing with a DOM object.

    0 讨论(0)
提交回复
热议问题