How to find all possible regex matches in python?

吃可爱长大的小学妹 提交于 2019-11-27 20:14:45
zmo

it's actually not possible to do this by using regular expressions, because regular expressions express a language defined by a regular grammar that can be solved by a non finite deterministic automaton, where matching is represented by states ; then to match nested parenthesis, you'd need to be able to match an infinite number of parenthesis and then have an automaton with an infinite number of states.

To be able to cope with that, we use what's called a push-down automaton, that is used to define the context free grammar.

So if your regex does not match nested parenthesis, it's because it's expressing the following automaton and does not match anything on your input:

Play with it

As a reference, please have a look at MIT's courses on the topic:

So one of the ways to parse your string efficiently, is to build a grammar for nested parenthesis (pip install pyparsing first):

>>> import pyparsing >>> strings = pyparsing.Word(pyparsing.alphanums) >>> parens  = pyparsing.nestedExpr( '(', ')', content=strings) >>> parens.parseString('(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))').asList() [['NP', ['NNP', 'Hoi'], ['NN', 'Hallo'], ['NN', 'Hey'], ['NNP', ['NN', 'Ciao'], ['NN', 'Adios']]]] 

N.B.: there exists a few regular expressions engines that do implement nested parenthesis matching using the push down. The default python re engine is not one of them, but an alternative engine exists, called regex (pip install regex) that can do recursive matching (which makes the re engine context free), cf this code snippet:

>>> import regex >>> res = regex.search(r'(?<rec>\((?:[^()]++|(?&rec))*\))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))') >>> res.captures('rec') ['(NNP Hoi)', '(NN Hallo)', '(NN Hey)', '(NN Ciao)', '(NN Adios)', '(NNP (NN Ciao) (NN Adios))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))'] 

Regular expressions used in modern languages DO NOT represent regular languages. zmo is right in saying that regular languages in Language Theroy are represented by finite state automata but the regular expressions that use any sort of backtracking like those with capturing groups, lookarounds and etc that are used in modern languages CANNOT be represented by FSAs known in Language Theory. How can you represent a pattern like (\w+)\1 with a DFA or even and NFA?

The regular expression you are looking for can be something like this(only matches to two levels):

(?=(\((?:[^\)\(]*\([^\)]*\)|[^\)\(])*?\))) 

I tested this on http://regexhero.net/tester/

The matches are in the captured groups:

1: (NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios))

1: (NNP Hoi)

1: (NN Hallo)

1: (NN Hey)

1: (NNP (NN Ciao) (NN Adios))

1: (NN Ciao)

1: (NN Adios)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!