Efficiently match multiple regexes in Python

梦想的初衷 提交于 2019-12-02 19:30:35

You can merge all your regexes into one using the "|" operator and let the regex library do the work of discerning between tokens. Some care should be taken to ensure the preference of tokens (for example to avoid matching a keyword as an identifier).

I suggest using the re.Scanner class, it's not documented in the standard library, but it's well worth using. Here's an example:

import re

scanner = re.Scanner([
    (r"-?[0-9]+\.[0-9]+([eE]-?[0-9]+)?", lambda scanner, token: float(token)),
    (r"-?[0-9]+", lambda scanner, token: int(token)),
    (r" +", lambda scanner, token: None),
])

>>> scanner.scan("0 -1 4.5 7.8e3")[0]
[0, -1, 4.5, 7800.0]

I found this in python document. It's just simple and elegant.

import collections
import re

Token = collections.namedtuple('Token', ['typ', 'value', 'line', 'column'])

def tokenize(s):
    keywords = {'IF', 'THEN', 'ENDIF', 'FOR', 'NEXT', 'GOSUB', 'RETURN'}
    token_specification = [
        ('NUMBER',  r'\d+(\.\d*)?'), # Integer or decimal number
        ('ASSIGN',  r':='),          # Assignment operator
        ('END',     r';'),           # Statement terminator
        ('ID',      r'[A-Za-z]+'),   # Identifiers
        ('OP',      r'[+*\/\-]'),    # Arithmetic operators
        ('NEWLINE', r'\n'),          # Line endings
        ('SKIP',    r'[ \t]'),       # Skip over spaces and tabs
    ]
    tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)
    get_token = re.compile(tok_regex).match
    line = 1
    pos = line_start = 0
    mo = get_token(s)
    while mo is not None:
        typ = mo.lastgroup
        if typ == 'NEWLINE':
            line_start = pos
            line += 1
        elif typ != 'SKIP':
            val = mo.group(typ)
            if typ == 'ID' and val in keywords:
                typ = val
            yield Token(typ, val, line, mo.start()-line_start)
        pos = mo.end()
        mo = get_token(s, pos)
    if pos != len(s):
        raise RuntimeError('Unexpected character %r on line %d' %(s[pos], line))

statements = '''
    IF quantity THEN
        total := total + price * quantity;
        tax := price * 0.05;
    ENDIF;
'''

for token in tokenize(statements):
    print(token)

The trick here is the line:

tok_regex = '|'.join('(?P<%s>%s)' % pair for pair in token_specification)

Here (?P<ID>PATTERN) will mark the matched result with a name specified by ID.

re.match is anchored. You can give it a position argument:

pos = 0
end = len(text)
while pos < end:
    match = regexp.match(text, pos)
    # do something with your match
    pos = match.end()

Have a look for pygments which ships a shitload of lexers for syntax highlighting purposes with different implementations, most based on regular expressions.

It's possible that combining the token regexes will work, but you'd have to benchmark it. Something like:

x = re.compile('(?P<NUMBER>[0-9]+)|(?P<VAR>[a-z]+)')
a = x.match('9999').groupdict() # => {'VAR': None, 'NUMBER': '9999'}
if a:
    token = [a for a in a.items() if a[1] != None][0]

The filter is where you'll have to do some benchmarking...

Update: I tested this, and it seems as though if you combine all the tokens as stated and write a function like:

def find_token(lst):
    for tok in lst:
        if tok[1] != None: return tok
    raise Exception

You'll get roughly the same speed (maybe a teensy faster) for this. I believe the speedup must be in the number of calls to match, but the loop for token discrimination is still there, which of course kills it.

This isn't exactly a direct answer to your question, but you might want to look at ANTLR. According to this document the python code generation target should be up to date.

As to your regexes, there are really two ways to go about speeding it up if you're sticking to regexes. The first would be to order your regexes in the order of the probability of finding them in a default text. You could figure adding a simple profiler to the code that collected token counts for each token type and running the lexer on a body of work. The other solution would be to bucket sort your regexes (since your key space, being a character, is relatively small) and then use a array or dictionary to perform the needed regexes after performing a single discrimination on the first character.

However, I think that if you're going to go this route, you should really try something like ANTLR which will be easier to maintain, faster, and less likely to have bugs.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!