I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.
Example:
i\'m l
$pattern = 'Paris';
$text = 'i\'m living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>, i love Paris.';
// 1. Define 2 arrays:
// $matches[1] - array of links with our keyword
// $matches[2] - array of keyword
preg_match_all('@(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)@', $text, $matches);
// Exists keywords for replace? Define first keyword without tag <a>
$number = array_search($pattern, $matches[2]);
// Keyword exists, let's go rock
if ($number !== FALSE) {
// Replace all link with temporary value
foreach ($matches[1] as $k => $tag) {
$text = preg_replace('@(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)@', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
}
// Replace our keywords with link
$text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', '<a href="">'.$pattern.'</a>', $text);
// Return link
foreach ($matches[1] as $k => $tag) {
$text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
}
// It's work!
echo $text;
}
Regexes don't replace. Languages do.
Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.)
s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i
Proper names might work better:
s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;
Of course "Baton Rouge" would become two links for:
<a href="http://en.wikipedia.org/wiki/Baton">Baton</a>
<a href="http://en.wikipedia.org/wiki/Rouge">Rouge</a>
In Perl, you can do this:
my $barred_list_of_cities
= join( '|'
, sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
);
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;
But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash.
Regular expression:
!(<a.*</a>.*)*Paris!isU
Replacement:
$1<a href="Paris">Paris</a>
$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.
This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".
PHP example:
<?php
$s = 'i\'m living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>, i love Paris.';
$regex = '!(<a.*</a>.*)*Paris!isU';
$replace = '$1<a href="Paris">Paris</a>';
$result = preg_replace( $regex, $replace, $s);
?>
Addition:
This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want. Nevertheless I see no way to solve your problem completely with a simple regular expression.
This is hard to do in one step. Writing a single regex that does that is virtually impossible.
Try a two-step approach.
<a href="..."><a href="...">Paris</a></a>
), and eliminate the inner link.Regex for step one is dead-simple:
\bParis\b
Regex for step two is slightly more complex:
(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>
Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.
Explanation of regex #2 in plain words:
<a[^>]+>
), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)
). Save it into match group 1.<a[^>]+>
). Make sure it is there, but do not save it.</a>
). Make sure it is there, but don't save it.The approach assumes these side conditions:
(?!:...)
).Paris
" becomes "<a href"...">Paris</a>
", or step two will fail (until you change the second regex).BTW: regex #2 explicitly allows for constructs like this:
<a href="">in the <b>capital of France</b>, <a href="">Paris</a></a>
The surplus link comes from step one, replacement result of step 2 will be:
<a href="">in the <b>capital of France</b>, Paris</a>
If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML.
You define two templates: One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.
You could search for this regular expression:
(<a[^>]*>.*?</a>)|Paris
This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.
Replace the match with your link only if the capturing group did not match anything.
E.g. in C#:
resultString =
Regex.Replace(
subjectString,
"(<a[^>]*>.*?</a>)|Paris",
new MatchEvaluator(ComputeReplacement));
public String ComputeReplacement(Match m) {
if (m.groups(1).Success) {
return m.groups(1).Value;
} else {
return "<a href=\"link to paris\">Paris</a>";
}
}