Determine regular expression's specificity

后端 未结 4 1537
傲寒
傲寒 2020-12-18 01:31

Given the following regular expressions:

 - alice@[a-z]+\\.[a-z]+
 - [a-z]+@[a-z]+\\.[a-z]+
 - .*

The string alice@myprovider.com

4条回答
  •  盖世英雄少女心
    2020-12-18 02:21

    The following is the solution to this problem I developed based on Donald Miner's research paper, implemented in Python, for rules applied to MAC addresses.

    Basically, the most specific match is from the pattern that is not a superset of any other matching pattern. For a particular problem domain, you create a series of tests (functions) which compare two REs and return which is the superset, or if they are orthogonal. This lets you build a tree of matches. For a particular input string, you go through the root patterns and find any matches. Then go through their subpatterns. If at any point, orthogonal patterns match, an error is raised.

    Setup

    import re
    
    class RegexElement:
        def __init__(self, string,index):
            self.string=string
            self.supersets = []
            self.subsets = []
            self.disjoints = []
            self.intersects = []
            self.maybes = []
            self.precompilation = {}
            self.compiled = re.compile(string,re.IGNORECASE)
            self.index = index
    
    SUPERSET  = object()
    SUBSET    = object()
    INTERSECT = object()
    DISJOINT  = object()
    EQUAL     = object()
    

    The Tests

    Each test takes 2 strings (a and b) and tries to determine how they are related. If the test cannot determine the relation, None is returned.

    SUPERSET means a is a superset of b. All matches of b will match a.

    SUBSET means b is a superset of a.

    INTERSECT means some matches of a will match b, but some won't and some matches of b won't match a.

    DISJOINT means no matches of a will match b.

    EQUAL means all matches of a will match b and all matches of b will match a.

        def equal_test(a, b):  
            if a == b: return EQUAL
    

    The graph

      class SubsetGraph(object):
        def __init__(self, tests):
            self.regexps = []
            self.tests = tests
            self._dirty = True
            self._roots = None
    
        @property
        def roots(self):
            if self._dirty:
                r = self._roots = [i for i in self.regexps if not i.supersets]
                return r
            return self._roots
    
        def add_regex(self, new_regex):
            roots = self.roots
            new_re = RegexElement(new_regex)
            for element in roots:
                self.process(new_re, element)
            self.regexps.append(new_re)
    
        def process(self, new_re, element):
            relationship = self.compare(new_re, element)
            if relationship:
                getattr(self, 'add_' + relationship)(new_re, element)
    
        def add_SUPERSET(self, new_re, element):
            for i in element.subsets:
                i.supersets.add(new_re)
                new_re.subsets.add(i)
    
            element.supersets.add(new_re)
            new_re.subsets.add(element)
    
        def add_SUBSET(self, new_re, element):
            for i in element.subsets:
                self.process(new_re, i)
    
            element.subsets.add(new_re)
            new_re.supersets.add(element)
    
        def add_DISJOINT(self, new_re, element):
            for i in element.subsets:
                i.disjoints.add(new_re)
                new_re.disjoints.add(i)
    
            new_re.disjoints.add(element)
            element.disjoints.add(new_re)
    
        def add_INTERSECT(self, new_re, element):
            for i in element.subsets:
                self.process(new_re, i)
    
            new_re.intersects.add(element)
            element.intersects.add(new_re)
    
        def add_EQUAL(self, new_re, element):
            new_re.supersets = element.supersets.copy()
            new_re.subsets = element.subsets.copy()
            new_re.disjoints = element.disjoints.copy()
            new_re.intersects = element.intersects.copy()
    
        def compare(self, a, b):
            for test in self.tests:
                result = test(a.string, b.string)
                if result:
                    return result
    
        def match(self, text, strict=True):
            matches = set()
            self._match(text, self.roots, matches)
            out = []
            for e in matches:
                for s in e.subsets:
                    if s in matches:
                        break
                else:
                    out.append(e)
            if strict and len(out) > 1:
                for i in out:
                    print(i.string)
                raise Exception("Multiple equally specific matches found for " + text)
            return out
    
        def _match(self, text, elements, matches):
            new_elements = []
            for element in elements:
                m = element.compiled.match(text)
                if m:
                    matches.add(element)
                    new_elements.extend(element.subsets)
            if new_elements:
                self._match(text, new_elements, matches)
    

    Usage

    graph = SubsetGraph([equal_test, test_2, test_3, ...])
    graph.add_regex("00:11:22:..:..:..")
    graph.add_regex("..(:..){5,5}"
    graph.match("00:de:ad:be:ef:00")
    

    A complete usable version is here.

提交回复
热议问题