Determine regular expression's specificity

后端未结

关注

 4  1537

傲寒 2020-12-18 01:31

Given the following regular expressions:

 - alice@[a-z]+\\.[a-z]+
 - [a-z]+@[a-z]+\\.[a-z]+
 - .*

The string alice@myprovider.com


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   盖世英雄少女心
                                             
                
                
                (楼主)
            
              
              
                2020-12-18 02:21
              

            
            
                        
The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.

Basically, the most specific match is from the pattern that is not a superset of any other matching pattern.  For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal.  This lets you build a tree of matches.  For a particular input string, you go through the root patterns and find any matches.  Then go through their subpatterns.  If at any point, orthogonal patterns match, an error is raised.

Setup

import re

class RegexElement:
    def __init__(self, string,index):
        self.string=string
        self.supersets = []
        self.subsets = []
        self.disjoints = []
        self.intersects = []
        self.maybes = []
        self.precompilation = {}
        self.compiled = re.compile(string,re.IGNORECASE)
        self.index = index

SUPERSET  = object()
SUBSET    = object()
INTERSECT = object()
DISJOINT  = object()
EQUAL     = object()


The Tests

Each test takes 2 strings (a and b) and tries to determine how they are related.  If the test cannot determine the relation, None is returned.

SUPERSET means a is a superset of b.  All matches of b will match a.

SUBSET means b is a superset of a.

INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.

DISJOINT means no matches of a will match b.

EQUAL means all matches of a will match b and all matches of b will match a.

    def equal_test(a, b):  
        if a == b: return EQUAL


The graph

  class SubsetGraph(object):
    def __init__(self, tests):
        self.regexps = []
        self.tests = tests
        self._dirty = True
        self._roots = None

    @property
    def roots(self):
        if self._dirty:
            r = self._roots = [i for i in self.regexps if not i.supersets]
            return r
        return self._roots

    def add_regex(self, new_regex):
        roots = self.roots
        new_re = RegexElement(new_regex)
        for element in roots:
            self.process(new_re, element)
        self.regexps.append(new_re)

    def process(self, new_re, element):
        relationship = self.compare(new_re, element)
        if relationship:
            getattr(self, 'add_' + relationship)(new_re, element)

    def add_SUPERSET(self, new_re, element):
        for i in element.subsets:
            i.supersets.add(new_re)
            new_re.subsets.add(i)

        element.supersets.add(new_re)
        new_re.subsets.add(element)

    def add_SUBSET(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        element.subsets.add(new_re)
        new_re.supersets.add(element)

    def add_DISJOINT(self, new_re, element):
        for i in element.subsets:
            i.disjoints.add(new_re)
            new_re.disjoints.add(i)

        new_re.disjoints.add(element)
        element.disjoints.add(new_re)

    def add_INTERSECT(self, new_re, element):
        for i in element.subsets:
            self.process(new_re, i)

        new_re.intersects.add(element)
        element.intersects.add(new_re)

    def add_EQUAL(self, new_re, element):
        new_re.supersets = element.supersets.copy()
        new_re.subsets = element.subsets.copy()
        new_re.disjoints = element.disjoints.copy()
        new_re.intersects = element.intersects.copy()

    def compare(self, a, b):
        for test in self.tests:
            result = test(a.string, b.string)
            if result:
                return result

    def match(self, text, strict=True):
        matches = set()
        self._match(text, self.roots, matches)
        out = []
        for e in matches:
            for s in e.subsets:
                if s in matches:
                    break
            else:
                out.append(e)
        if strict and len(out) > 1:
            for i in out:
                print(i.string)
            raise Exception("Multiple equally specific matches found for " + text)
        return out

    def _match(self, text, elements, matches):
        new_elements = []
        for element in elements:
            m = element.compiled.match(text)
            if m:
                matches.add(element)
                new_elements.extend(element.subsets)
        if new_elements:
            self._match(text, new_elements, matches)


Usage

graph = SubsetGraph([equal_test, test_2, test_3, ...])
graph.add_regex("00:11:22:..:..:..")
graph.add_regex("..(:..){5,5}"
graph.match("00:de:ad:be:ef:00")


A complete usable version is here.
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它4个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复