Non-greedy string regular expression matching

后端未结

关注

 2  2035

独厮守ぢ 2020-11-28 10:21

I\'m pretty sure I\'m missing something obvious here, but I cannot make R to use non-greedy regular expressions:

> library(stringr)
> str_match(\'xxx a


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   时光说笑
                                             
                
                
                (楼主)
            
              
              
                2020-11-28 11:00
              

            
            
                        
The problem is matching the shortest window between two strings. @flodel correctly mentions that a regex engine is parsing the string from left to right, and thus all the matches are leftmost. Greediness and laziness only apply to the boundaries on the right: greedy quantifiers get the substrings up to the rightmost boundaries, and the lazy ones will match up to the first occurrence of the subpatterns to follow.

See the examples:

> library(stringr)
> str_extract('xxx aaaab yyy', "a[^ab]*b")
[1] "ab"
> str_extract('xxx aaa xxx aaa zzz', "xxx.*?zzz")
[1] "xxx aaa xxx aaa zzz"
> str_extract('xxx aaa xxx aaa zzz', "xxx(?:(?!xxx|zzz).)*zzz")
[1] "xxx aaa zzz"


The first and the third scenarios return the shortest window, the second one is an illustration of the current problem but with a multicharacter input.

Scenario 1. Boundaries are single characters

In case a and b are single characters, the shortest window is found by using a negated character class. a[^ab]*b will easily grab the substring from a till the next b with no as and bs in between.

Scenario 2. Boundaries are not single characters

You may use a tempered greedy token in these cases that can be further unrolled. The xxx(?:(?!xxx|zzz).)*zzz pattern matches xxx, then any 0+ chars other than a linebreak char that is not the starting char of a xxx or zzz char sequence (the (?!xxx|zzz) is a negative lookahead that fails the match if the substring immediately to the right matches the lookahead pattern), and then a zzz.

These matching scenarios can be easily used with base R regmatches (using a PCRE regex flavor that supports lookaheads):

> x <- 'xxx aaa xxx aaa zzz xxx bbb xxx ccc zzz'
> unlist(regmatches(x, gregexpr("xxx(?:(?!xxx|zzz).)*zzz", x, perl = TRUE)))
[1] "xxx aaa zzz" "xxx ccc zzz"


One note: when using a PCRE regex in base R, or the ICU regex in str_extract/str_match, the . does not match linebreak characters, to enable that behavior, you need to add (?s) at the pattern start (an inline DOTALL modifier).
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复