How to add rel=“nofollow” to links with preg_replace()

后端 未结 7 2077

The function below is designed to apply rel=\"nofollow\" attributes to all external links and no internal links unless the path matches a predefined root URL de

相关标签:
7条回答
  • 2020-12-09 20:20

    Try this one (PHP 5.3+):

    • skip selected address
    • allow manually set rel parameter

    and code:

    function nofollow($html, $skip = null) {
        return preg_replace_callback(
            "#(<a[^>]+?)>#is", function ($mach) use ($skip) {
                return (
                    !($skip && strpos($mach[1], $skip) !== false) &&
                    strpos($mach[1], 'rel=') === false
                ) ? $mach[1] . ' rel="nofollow">' : $mach[0];
            },
            $html
        );
    }
    

    Examples:

    echo nofollow('<a href="link somewhere" rel="something">something</a>');
    // will be same because it's already contains rel parameter
    
    echo nofollow('<a href="http://www.cnn.com">something</a>'); // ad
    // add rel="nofollow" parameter to anchor
    
    echo nofollow('<a href="http://localhost">something</a>', 'localhost');
    // skip this link as internall link
    
    0 讨论(0)
  • 2020-12-09 20:24

    Here is the DOMDocument solution...

    $str = '<a href="http://localhost/mytest/">internal</a>
    
    <a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
    
    <a href="http://cnn.com" rel="me">external</a>
    
    <a href="http://google.com">external</a>
    
    <a href="http://example.com" rel="nofollow">external</a>
    
    <a href="http://stackoverflow.com" rel="junk in the rel">external</a>
    ';
    $dom = new DOMDocument();
    
    $dom->preserveWhitespace = FALSE;
    
    $dom->loadHTML($str);
    
    $a = $dom->getElementsByTagName('a');
    
    $host = strtok($_SERVER['HTTP_HOST'], ':');
    
    foreach($a as $anchor) {
            $href = $anchor->attributes->getNamedItem('href')->nodeValue;
    
            if (preg_match('/^https?:\/\/' . preg_quote($host, '/') . '/', $href)) {
               continue;
            }
    
            $noFollowRel = 'nofollow';
            $oldRelAtt = $anchor->attributes->getNamedItem('rel');
    
            if ($oldRelAtt == NULL) {
                $newRel = $noFollowRel;
            } else {
                $oldRel = $oldRelAtt->nodeValue;
                $oldRel = explode(' ', $oldRel);
                if (in_array($noFollowRel, $oldRel)) {
                    continue;
                }
                $oldRel[] = $noFollowRel;
                $newRel = implode($oldRel,  ' ');
            }
    
            $newRelAtt = $dom->createAttribute('rel');
            $noFollowNode = $dom->createTextNode($newRel);
            $newRelAtt->appendChild($noFollowNode);
            $anchor->appendChild($newRelAtt);
    
    }
    
    var_dump($dom->saveHTML());
    

    Output

    string(509) "<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
    <html><body>
    <a href="http://localhost/mytest/">internal</a>
    
    <a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
    
    <a href="http://cnn.com" rel="me nofollow">external</a>
    
    <a href="http://google.com" rel="nofollow">external</a>
    
    <a href="http://example.com" rel="nofollow">external</a>
    
    <a href="http://stackoverflow.com" rel="junk in the rel nofollow">external</a>
    </body></html>
    "
    
    0 讨论(0)
  • 2020-12-09 20:26

    Using regular expressions to do this job properly would be quite complicated. It would be easier to use an actual parser, such as the one from the DOM extension. DOM isn't very beginner-friendly, so what you can do is load the HTML with DOM then run the modifications with SimpleXML. They're backed by the same library, so it's easy to use one with the other.

    Here's how it can look like:

    $my_folder = 'http://localhost/mytest/go/';
    $blog_url = 'http://localhost/mytest';
    
    $html = '<html><body>
    <a href="http://localhost/mytest/">internal</a>
    <a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
    <a href="http://cnn.com">external</a>
    </body></html>';
    
    $dom = new DOMDocument;
    $dom->loadHTML($html);
    
    $sxe = simplexml_import_dom($dom);
    
    // grab all <a> nodes with an href attribute
    foreach ($sxe->xpath('//a[@href]') as $a)
    {
        if (substr($a['href'], 0, strlen($blog_url)) === $blog_url
         && substr($a['href'], 0, strlen($my_folder)) !== $my_folder)
        {
            // skip all links that start with the URL in $blog_url, as long as they
            // don't start with the URL from $my_folder;
            continue;
        }
    
        if (empty($a['rel']))
        {
            $a['rel'] = 'nofollow';
        }
        else
        {
            $a['rel'] .= ' nofollow';
        }
    }
    
    $new_html = $dom->saveHTML();
    echo $new_html;
    

    As you can see, it's really short and simple. Depending on your needs, you may want to use preg_match() in place of the strpos() stuff, for example:

        // change the regexp to your own rules, here we match everything under
        // "http://localhost/mytest/" as long as it's not followed by "go"
        if (preg_match('#^http://localhost/mytest/(?!go)#', $a['href']))
        {
            continue;
        }
    

    Note

    I missed the last code block in the OP when I first read the question. The code I posted (and basically any solution based on DOM) is better suited at processing a whole page rather than a HTML block. Otherwise, DOM will attempt to "fix" your HTML and may add a <body> tag, a DOCTYPE, etc...

    0 讨论(0)
  • 2020-12-09 20:32
    <?
    
    $str='<a href="http://localhost/mytest/">internal</a>
    <a href="http://localhost/mytest/go/hostgator">internal cloaked link</a>
    <a href="http://cnn.com">external</a>';
    
    function test($x){
      if (preg_match('@localhost/mytest/(?!go/)@i',$x[0])>0) return $x[0];
      return 'rel="nofollow" '.$x[0];
    }
    
    echo preg_replace_callback('/href=[\'"][^\'"]+/i', 'test', $str);
    
    ?>
    
    0 讨论(0)
  • 2020-12-09 20:33

    Try to make it more readable first, and only afterwards make your if rules more complex:

    function save_rseo_nofollow($content) {
        $content["post_content"] =
        preg_replace_callback('~<(a\s[^>]+)>~isU', "cb2", $content["post_content"]);
        return $content;
    }
    
    function cb2($match) { 
        list($original, $tag) = $match;   // regex match groups
    
        $my_folder =  "/hostgator";       // re-add quirky config here
        $blog_url = "http://localhost/";
    
        if (strpos($tag, "nofollow")) {
            return $original;
        }
        elseif (strpos($tag, $blog_url) && (!$my_folder || !strpos($tag, $my_folder))) {
            return $original;
        }
        else {
            return "<$tag rel='nofollow'>";
        }
    }
    

    Gives following output:

    [post_content] =>
      <a href="http://localhost/mytest/">internal</a>
      <a href="http://localhost/mytest/go/hostgator" rel=nofollow>internal cloaked link</a>    
      <a href="http://cnn.com" rel=nofollow>external</a>
    

    The problem in your original code might have been $rseo which wasn't declared anywhere.

    0 讨论(0)
  • 2020-12-09 20:39

    Here is the another solution which has whitelist option and add tagret Blank attribute. And also it check if there already a rel attribute before add a new one.

    function Add_Nofollow_Attr($Content, $Whitelist = [], $Add_Target_Blank = true) 
    {
        $Whitelist[] = $_SERVER['HTTP_HOST'];
        foreach ($Whitelist as $Key => $Link) 
        {
            $Host = preg_replace('#^https?://#', '', $Link);
            $Host = "https?://". preg_quote($Host, '/');
            $Whitelist[$Key] = $Host;
        }
    
        if(preg_match_all("/<a .*?>/", $Content, $matches, PREG_SET_ORDER)) 
        {
            foreach ($matches as $Anchor_Tag) 
            {
                $IS_Rel_Exist = $IS_Follow_Exist = $IS_Target_Blank_Exist = $Is_Valid_Tag =  false;
                if(preg_match_all("/(\w+)\s*=\s*['|\"](.*?)['|\"]/",$Anchor_Tag[0],$All_matches2)) 
                {
                    foreach ($All_matches2[1] as $Key => $Attr_Name)
                    {
                        if($Attr_Name == 'href')
                        {
                            $Is_Valid_Tag = true;
                            $Url = $All_matches2[2][$Key];
                            // bypass #.. or internal links like "/"
                            if(preg_match('/^\s*[#|\/].*/', $Url)) 
                            {
                                continue 2;
                            }
    
                            foreach ($Whitelist as $Link) 
                            {
                                if (preg_match("#$Link#", $Url)) {
                                    continue 3;
                                }
                            }
                        }
                        else if($Attr_Name == 'rel')
                        {
                            $IS_Rel_Exist = true;
                            $Rel = $All_matches2[2][$Key];
                            preg_match("/[n|d]ofollow/", $Rel, $match, PREG_OFFSET_CAPTURE);
                            if( count($match) > 0 )
                            {
                                $IS_Follow_Exist = true;
                            }
                            else
                            {
                                $New_Rel = 'rel="'. $Rel . ' nofollow"';
                            }
                        }
                        else if($Attr_Name == 'target')
                        {
                            $IS_Target_Blank_Exist = true;
                        }
                    }
                }
    
                $New_Anchor_Tag = $Anchor_Tag;
                if(!$IS_Rel_Exist)
                {
                    $New_Anchor_Tag = str_replace(">",' rel="nofollow">',$Anchor_Tag);
                }
                else if(!$IS_Follow_Exist)
                {
                    $New_Anchor_Tag = preg_replace("/rel=[\"|'].*?[\"|']/",$New_Rel,$Anchor_Tag);
                }
    
                if($Add_Target_Blank && !$IS_Target_Blank_Exist)
                {
                    $New_Anchor_Tag = str_replace(">",' target="_blank">',$New_Anchor_Tag);
                }
    
                $Content = str_replace($Anchor_Tag,$New_Anchor_Tag,$Content);
            }
        }
        return $Content;
    }
    

    To use it:

    $Page_Content = '<a href="http://localhost/">internal</a>
                     <a href="http://yoursite.com">internal</a>
                     <a href="http://google.com">google</a>
                     <a href="http://example.com" rel="nofollow">example</a>
                     <a href="http://stackoverflow.com" rel="random">stackoverflow</a>';
    
    $Whitelist = ["http://yoursite.com","http://localhost"];
    
    echo Add_Nofollow_Attr($Page_Content,$Whitelist,true);
    
    0 讨论(0)
提交回复
热议问题