问题
I'm trying to replace Urls contained inside a HTML code block the users post into an old web-app with proper anchors (<A>) for those Urls.
The problem is that Urls can be already 'anchored', that is contained in <A> elements. Those Url should not be replaced.
Example:
<a href="http://noreplace.com">http://noreplace.com</a> <- do not replace
<a href="http://noreplace.com"><u>http://noreplace.com</u></a> <- do not replace
<a href="...">...</a>http://replace.com <- replace
What would the regex to match only 'not anchored Urls' look like?
I use the following function to replace with RegEx:
Function ReplaceRegExp(strString, strPattern, strReplace)
Dim RE: Set RE = New RegExp
With RE
.Pattern = strPattern
.IgnoreCase = True
.Global = True
ReplaceRegExp = .Replace(strString, strReplace)
End With
End Function
The following non greedy regex is used to format UBB URLs. Can this regex be adapted to match only the ones I need?
' the double doublequote in the brackets is because
' double doublequoting is ASP escaping for doublequotes
strString = ReplaceRegExp(strString, "\[URL=[""]?(http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?[""]?\](.*?)\[/URL\]", "<a href=""$1$2$3$5"" target=""_blank"">$6</a>")
If this really cannot be done with RegEx, what would be the solution in ASP Classic, with some code or pseudocode please? However I would really like to keep code simple with an additional regex line than add additional functions to this old code.
Thanks for your effort!
回答1:
Seems like regular expressions are too complex to use for this kind of job so I went to my rusty VBScript skills and coded a function that first removes anchors and then replaces the URLs.
Here it is if somebody may need it:
Function Linkify(Text)
Dim regEx, Match, Matches, patternURLs, patternAnchors, lCount, anchorCount, replacements
patternURLs = "((http|ftp|https)(:\/\/[\w\-_]+)((\.[\w\-_]+)+)([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?)"
patternAnchors = "<a[^>]*?>.*?</a>"
Set replacements=Server.CreateObject("Scripting.Dictionary")
' Create the regular expression.
Set regEx = New RegExp
regEx.Pattern = patternAnchors
regEx.IgnoreCase = True
regEx.Global = True
' Do the search for anchors.
Set Matches = regEx.Execute(Text)
lCount = 0
' Iterate through the existing anchors and replace with a placeholder
For Each Match in Matches
key = "<#" & lCount & "#>"
replacements.Add key, Match.Value
Text = Replace(Text,Cstr(Match.Value),key)
lCount = lCount+1
Next
anchorCount = lCount
' we now search for URls
regEx.Pattern = patternURLs
' create anchors from URLs
Text = regEx.Replace(Text, "<a href=""$1"">$1</a>")
' put back the originally existing anchors
For lCount = 0 To anchorCount-1
key = "<#" & lCount & "#>"
Text = Replace(Text,key, replacements.Item(key))
Next
Linkify = Text
End Function
回答2:
The answer you're looking for is in negative and positive look aheads and look behinds
This article gives a pretty good overview: http://www.regular-expressions.info/lookaround.html
Here's the Regular Expression I've formulated for your case:
(?<!"|>)(ht|f)tps?://.*?(?=\s|$)
Here's some sample data I matched against:
#Matches
http://www.website.com
https://www.website.com
This is a link http://www.website.com that is not linked
This is a long link http://www.website.com/index.htm?foo=bar
ftp://www.website.com
#No Matches
<u>http://www.website.com</u>
<a href="http://www.website.com">http://website.com</a>
<a href="https://www.website.com">http://website.com</a>
<a href="http://www.website.com"><u>http://www.website.com</u></a>
<a href="ftp://www.website.com">ftp://www.website.com</a>
Here's a breakdown of what the regular expression is doing:
(?<!"|>)
A negative look behind, making sure what matches next isn't preceded by a " or >
(ht|f)tps?://.*?
This looks for http, https, or ftp and anything following it. It'll also match ftps! If you want to avoid this, you could use (https?|ftp)://.*? instead
(?=\s|$)
This is a positive look ahead, which matches a space or end of line.
EXTRA CREDIT
(ht)?(?(1)tps?|ftp)://
This will match http/https/ftp but not ftps, this may be a bit overkill when you can use (https?|ftp):// but it's an awesome example of if/else in regex.
回答3:
Some design issues you're going to have to work around:
- Embedded URLs could be absolute or relative and may not include the protocol.
- Your HTML may not have quotes around attribute values.
- The character right after a URL may also be a valid URL character.
- There are lots of valid URL characters these days.
If you can assume (1) absolute URLs with protocols and (2) quoted HTML attributes and (3) people will have whitespace after a URL and (4) you're sticking with supporting only basic URL characters, you can just look for URLs not preceded by a double-quote.
Here's an overly-simple example to start with (untested):
(?<!")((http|https|ftp)://[^\s<>])(?=\s|$) replaced with <a href="$1">$1</a>
The [^\s<>] part above is ridiculously greedy, so all of the fun will be in tweaking that to build a character set that fits the URLs your users are typing in. Your example shows a much more involved character class with \w plus a hodge-podge of other allowed characters, so you could start there if you want.
来源:https://stackoverflow.com/questions/7252866/regex-replace-all-url-s-that-are-not-anchored