I am trying to write a RegEx rule to find all a href HTML links on my webpage and add a \'rel=\"nofollow\"\' to them.
However, I have a list of URLs that must be exc
An improvement to James' regex:
(]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>
This regex will matches links NOT in the string array $follow_list. The strings don't need a leading 'www'. :)
The advantage is that this regex will preserve other arguments in the tag (like target, style, title...). If a rel
argument already exists in the tag, the regex will NOT match, so you can force follows on urls not in $follow_list
Replace the with:
$1$2$3"$4 rel="nofollow">
Full example (PHP):
function dont_follow_links( $html ) {
// follow these websites only!
$follow_list = array(
'google.com',
'mypage.com',
'otherpage.com',
);
return preg_replace(
'%(]*)(href="https?://)((?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"((?!.*\brel=)[^>]*)(?:[^>]*)>%',
'$1$2$3"$4 rel="nofollow">',
$html);
}
If you want to overwrite rel
no matter what, I would use a preg_replace_callback
approach where in the callback the rel attribute is replaced separately:
$subject = preg_replace_callback('%(]*href="https?://(?:(?!(?:(?:www\.)?'.implode('|(?:www\.)?', $follow_list).'))[^"]+)"[^>]*)>%', function($m) {
return preg_replace('%\srel\s*=\s*(["\'])(?:(?!\1).)*\1(\s|$)%', ' ', $m[1]).' rel="nofollow">';
}, $subject);