Regular expression replace a word by a link

后端未结

关注

 7  2422

I want to write a regular expression that will replace the word Paris by a link, for only the word is not ready a part of a link.

Example:

    i\'m l


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2020-12-18 12:08
              
            
            
                                                                       
  $pattern = 'Paris';
  $text = 'i\'m living <a href="Paris" atl="Paris link">in Paris</a>,  near Paris <a href="gare">Gare du Nord</a>,  i love Paris.';

  // 1. Define 2 arrays:
  //  $matches[1] - array of links with our keyword
  //  $matches[2] - array of keyword
  preg_match_all('@(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)|(?<!\pL)('.$pattern.')(?!\pL)@', $text, $matches);

  // Exists keywords for replace? Define first keyword without tag <a>
  $number = array_search($pattern, $matches[2]);

  // Keyword exists, let's go rock
  if ($number !== FALSE) {

    // Replace all link with temporary value
    foreach ($matches[1] as $k => $tag) {
      $text = preg_replace('@(<a[^>]*?>[^<]*?'.$pattern.'[^<]*?</a>)@', 'KEYWORD_IS_ALREADY_LINK_'.$k, $text, 1);
    }

    // Replace our keywords with link
    $text = preg_replace('/(?<!\pL)('.$pattern.')(?!\pL)/', '<a href="">'.$pattern.'</a>', $text);

    // Return link
    foreach ($matches[1] as $k => $tag) {

      $text = str_replace('KEYWORD_IS_ALREADY_LINK_'.$k, $tag, $text);
    }

    // It's work!
    echo $text;
  }

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  太阳男子        
                
              
                            
                2020-12-18 12:08
              
            
            
                                                                       
Regexes don't replace. Languages do.

Languages and libraries would also read from the database or file that holds the list of words you care about, and associate a URL with their name. Here's the easiest substitution I can imagine possible my a single regex (perl is used for the replacement syntax.) 

s/([a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/i


Proper names might work better:

s/([A-Z][a-z-']+)/<a href="http:\/\/en.wikipedia.org\/wiki\/$1">$1<\/a>/gi;


Of course "Baton Rouge" would become two links for:

<a href="http://en.wikipedia.org/wiki/Baton">Baton</a> 
<a href="http://en.wikipedia.org/wiki/Rouge">Rouge</a>


In Perl, you can do this:

my $barred_list_of_cities 
    = join( '|'
    , sort { ( length $a <=> $b ) || ( $a cmp $b ) } keys %url_for_city_of
    );
s/($barred_list_of_cities)/<a href="$url_for_city_of{$1}">$1<\/a>/g;


But again, it's a language that implements a set of operations for regexes, regexes don't do a thing. (In reality, it's such a common application, that I'd be surprised if there isn't a CPAN module out there somewhere that does this, and you just need to load the hash. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  耶瑟儿～        
                
              
                            
                2020-12-18 12:09
              
            
            
                                                                       
Regular expression:

!(<a.*</a>.*)*Paris!isU


Replacement:

$1<a href="Paris">Paris</a>


$1 referes to the first sub-pattern (at least in PHP). Depending on the language you use it could be slightly different.

This should replace all occurencies of "Paris" with the link in the replacement. It just checks whether all opening a-Tags were closed before "Paris".

PHP example:

<?php
$s = 'i\'m living <a href="Paris" atl="Paris link">in Paris</a>, near Paris <a href="gare">Gare du Nord</a>, i love Paris.'; 
$regex = '!(<a.*</a>.*)*Paris!isU'; 
$replace = '$1<a href="Paris">Paris</a>'; 
$result = preg_replace( $regex, $replace, $s); 
?>


Addition:

This is not the best solution. One situation where this regex won't work is when you have a img-Tag, which is not within an a-Element. When you set the title-Attribute of that image to "Paris", this "Paris" will be replaced, too. And that's not what you want.
Nevertheless I see no way to solve your problem completely with a simple regular expression.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不知归路        
                
              
                            
                2020-12-18 12:15
              
            
            
                                                                       
This is hard to do in one step. Writing a single regex that does that is virtually impossible.

Try a two-step approach.


Put a link around every "Paris" there is, regardless if there already is another link present.
Find all incorrectly nested links (<a href="..."><a href="...">Paris</a></a>), and eliminate the inner link.


Regex for step one is dead-simple:

\bParis\b


Regex for step two is slightly more complex:

(<a[^>]+>.*?(?!:</a>))<a[^>]+>(Paris)</a>


Use that one on the whole string and replace it with the content of match groups 1 and 2, effectively removing the surplus inner link.

Explanation of regex #2 in plain words:


Find every link (<a[^>]+>), optionally followed by anything that is not itself followed by a closing link (.*?(?!:</a>)). Save it into match group 1.
Now look for the next link (<a[^>]+>). Make sure it is there, but do not save it.
Now look for the word Paris. Save it into match group 2.
Look for a closing link (</a>). Make sure it is there, but don't save it.
Replace everything with the content of groups 1 and 2, thereby losing everything you did not save.


The approach assumes these side conditions:


Your input HTML is not horribly broken.
Your regex flavor supports non-greedy quantifiers (.*?) and zero-width negative look-ahead assertions ((?!:...)).
You wrap the word "Paris" only in a link in step 1, no additional characters. Every "Paris" becomes "<a href"...">Paris</a>", or step two will fail (until you change the second regex).
BTW: regex #2 explicitly allows for constructs like this:

<a href="">in the <b>capital of France</b>, <a href="">Paris</a></a>

The surplus link comes from step one, replacement result of step 2 will be:

<a href="">in the <b>capital of France</b>, Paris</a>

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  旧巷少年郎        
                
              
                            
                2020-12-18 12:18
              
            
            
                                                                       
If you weren't limited to using Regular expressions in this case, XSLT is a good choice for a language in which you can define this replacement, because it 'understands' XML. 

You define two templates:
One template finds links and removes those links that don't have "Paris" as the body text. Another template finds everything else, splits it into words and adds tags.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  小蘑菇        
                
              
                            
                2020-12-18 12:28
              
            
            
                                                                       
You could search for this regular expression:

(<a[^>]*>.*?</a>)|Paris


This regex matches a link, which it captures into the first (and only) capturing group, or the word Paris.

Replace the match with your link only if the capturing group did not match anything.

E.g. in C#:

resultString = 
    Regex.Replace(
        subjectString, 
        "(<a[^>]*>.*?</a>)|Paris", 
        new MatchEvaluator(ComputeReplacement));

public String ComputeReplacement(Match m) {
    if (m.groups(1).Success) {
        return m.groups(1).Value;
    } else {
        return "<a href=\"link to paris\">Paris</a>";
    }
}

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复