Determine regular expression's specificity

后端未结

关注

 4  1532

傲寒 2020-12-18 01:31

Given the following regular expressions:

 - alice@[a-z]+\\.[a-z]+
 - [a-z]+@[a-z]+\\.[a-z]+
 - .*

The string alice@myprovider.com


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   天命终不由人
                                             
                
                
                (楼主)
            
              
              
                2020-12-18 02:26
              

            
            
                        
This is a bit of a hack, but it could provide a practical solution to this question asked nearly 10 years ago.
As pointed out by @torak, there are difficulties in defining what it means for one regular expression to be more specific than another.
My suggestion is to look at how stable the regular expression is with respect to a string that matches it. The usual way to investigate stability is to make minor changes to the inputs, and see if you still get the same result.
For example, the string alice@myprovider.com matches the regex /alice@myprovider\.com/, but if you make any change to the string, it will not match. So this regex is very unstable. But the regex /.*/ is very stable, because you can make any change to the string, and it still matches.
So, in looking for the most specific regex, we are looking for the least stable one with respect to a string that matches it.
In order to implement this test for stability, we need to define how we choose a minor change to the string that matches the regex. This is another can of worms. We could for example, choose to change each character of the string to something random and test that against the regex, or any number of other possible choices. For simplicity, I suggest deleting one character at a time from the string, and testing that.
So, if the string that matches is N characters long, we have N tests to make. Lets's look at deleting one character at a time from the string alice@foo.com, which matches all of the regular expressions in the table below. It's 12 characters long, so there are 12 tests. In the table below,

0 means the regex does not match (unstable),
1 means it matches (stable)

              /alice@[a-z]+\.[a-z]+/    /[a-z]+@[a-z]+\.[a-z]+/     /.*/
  
lice@foo.com           0                           1                  1
aice@foo.com           0                           1                  1
alce@foo.com           0                           1                  1
alie@foo.com           0                           1                  1
alic@foo.com           0                           1                  1
alicefoo.com           0                           0                  1
alice@oo.com           1                           1                  1
alice@fo.com           1                           1                  1
alice@fo.com           1                           1                  1
alice@foocom           0                           0                  1 
alice@foo.om           1                           1                  1
alice@foo.cm           1                           1                  1
                      ---                         ---                ---  
total score:           5                          10                 12

The regex with the lowest score is the most specific. Of course, in general, there may be more than one regex with the same score, which reflects the fact there are regular expressions which by any reasonable way of measuring specificity are as specific as one another. Although it may also yield the same score for regular expressions that one can easily argue are not as specific as each other (if you can think of an example, please comment).
But coming back to the question asked by @torak, which of these is more specific:
alice@[a-z]+\.[a-z]+ 
[a-z]+@myprovider.com

We could argue that the second is more specific because it constrains more characters, and the above test will agree with that view.
As I said, the way we choose to make minor changes to the string that matches more than one regex is a can of worms, and the answer that the above method yields may depend on that choice. But as I said, this is an easily implementable hack - it is not rigourous.
And, of course the method breaks if the string that matches is empty. The usefulness if the test will increase as the length of the string increases. With very short strings, it is more likely produce  equal scores for regular expressions that are clearly different in their specificity.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复