Generating a list of values a regex COULD match in Python

后端 未结 4 788
借酒劲吻你
借酒劲吻你 2021-01-07 00:04

I\'m trying to use a regex as an input, and from there generate all the possible values that the regex would match.

So, for example, if the regex is \"three-letter w

相关标签:
4条回答
  • 2021-01-07 00:16

    Some regular expressions match a finite number of input strings, but many (most?) match an infinite number of input strings. It's kind of like asking 'given the python language grammar, generate all possible python programs'. You probably could write a program to list them all in sequence if you tried (though it would take infinite time to run), but are you sure you want to? Why would you want to?

    I'm pretty sure the regular expression engine in the standard library does not expose a way to generate the output you want. You'd have to get lower level access to the internal data structures, or implement some DFA engine thing-a-ma-bob yourself.

    0 讨论(0)
  • 2021-01-07 00:18

    You don't want to do this. Most of the result sets will be huge, and some will be infinite. Instead use a sequence of test vectors and apply the regex against each in turn:

    vectors = (
      'foo',
      'bar',
      ...
    )
    
    for result in (re.match(someregex, entry) for entry in vectors):
      ...
    
    0 讨论(0)
  • 2021-01-07 00:19

    Here's a brute force solution that should work. It has a running time of O(L^max_length) (where L is the size of the alphabet), so use it at your own risk.

    def all_matching_strings(alphabet, max_length, regex):
    """Find the list of all strings over 'alphabet' of length up to 'max_length' that match 'regex'"""
    
    if max_length == 0: return 
    
    L = len(alphabet)
    for N in range(1, max_length+1):
        indices = [0]*N
        for z in xrange(L**N):
            r = ''.join(alphabet[i] for i in indices)
            if regex.match(r):                
               yield(r)
    
            i = 0
            indices[i] += 1
            while (i<N) and (indices[i]==L):
                indices[i] = 0
                i += 1
                if i<N: indices[i] += 1
    
    return
    

    example usage:

    alphabet = 'abcdef1234567890'
    import re
    regex = re.compile('f*[1-3]+$')
    for r in all_matching_strings(alphabet, 5, regex): 
        print r
    

    which would output all strings up to length 5, starting with a sequence of f's, and then a non empty sequence of 1-3, then ending:

    1
    2
    3
    f1
    11
    21
    31
    f2
    12
    22
    32
    f3
    13
    23
    33
    ff1
    [more output omitted...]
    
    0 讨论(0)
  • 2021-01-07 00:30

    The set of matching strings is infinite if and only if there is a quantifier (+ or *) in your regexp. Your question doesn't seem to aim at those patterns. I rather believe that the product function from itertools might help here.

    You might for instance introduce a special character indicating an arbitrary letter (e.g. an underscore), then build a pattern like this

    patt = 'a_c'
    

    and define your alphabet

    youralphabet = 'abcde...'
    

    and define a function generating all possible instances like this

    def genInstances(patt):
        elems = [c if c != '_' else youralphabet for c in patt]
        return itertools.product(*elems)
    

    You may then extend this approach to match real regexp by parsing your pattern for \d or [a-zA-Z] or whatever.

    0 讨论(0)
提交回复
热议问题