Perl regex matching optional phrase in longer sentence

前端未结

关注

 3  1409

北海茫月 2021-01-01 05:28

I\'m trying to match an optional (possibly present) phrase in a sentence:

perl -e \'$_=\"word1 word2 word3\"; print \"1:$1 2:$2 3:$3\\n\" if m/(word1).*(word


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   谎友^
                                             
                
                
                (楼主)
            
              
              
                2021-01-01 06:14
              

            
            
                        
#BACKGROUND: HOW LAZY AND GREEDY QUANTIFIERS WORK
You need to understand how greedy and lazy quantifiers work. Greedy ones will grab the text their patterns can match at once, and then the engine will backtrack, i.e. it will try to go back to the place where the greedily quantified subpattern matched the substring, trying to check if the next subpattern can be matched.
Lazy matching patterns just match the minimum characters first, and then tries to match with the rest of the subpatterns. With *?, it matches zero characters, an empty space, and then goes on to check if the next pattern can be matched, and only if it cannot, the lazy subpattern will be "expanded" to include one more character, and so on.
So, (word1).*(word2)?.*(word3) will match the word2 with the first .* (and the second .* will match an empty space as the first .* is greedy. Although you can think that (word2)? is greedy and thus must be backtracked to, the answer is no, because the first .* grabbed all the string, and then the engine went backwards looking for the match. Since (word2)? matches an empty string, it always matched, and word3 was matched first from the end of the string. See this demo and check the regex debugger section.
You thought, let's use lazy matching with the first .\*?. The problem with (word1).*?(word2)?.*(word3) (that matches word2 with the second .* that is greedy) is a bit different as it could match the optional group. How? The first .*? matches zero characters, then tries to match all subsequent subpatterns. Thus, it found word1, then an empty string, and did not find the word2 right after word1. If word2 were right after word1, there would be a match with the first .*?. See this demo.
#SOLUTION
There are two solutions that I see at this moment, and they both consist in making the second optional group "exclusive" for the rest of the pattern, so that the regex engine could not skip it if found.

A branch reset solution provided by Casimir above. Its disadvantage is that it cannot be ported to many other regex flavors that do not support branch reset. See description in the original answer.
Use a tempered greedy token: (word1)(?:(?!word2).)*(word2)?.*?(word3). It is less efficient than the branch reset solution, but can be ported to  JS, Python, and most other regex flavors supporting lookaheads. How does that work? (?:(?!word2).)* matches 0+ occurrences of any character other than a newline (with /s, even including a newline) that does not start a literal character sequence word2. If w is matched, it cannot be followed with ord2 for the construct to match. Thus, when it reaches word2, it stops and lets the subsequent subpattern - (word2)? - match and capture the following word2. To make this approach more efficient*, use unroll the loop technique: (word1)[^w]*(?:w(?!ord2)[^w]*)*(word2)?.*?(word3).

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复