How to find all possible regex matches in python?

后端 未结 2 656
醉梦人生
醉梦人生 2020-12-05 11:47

I am trying to find all possible word/tag pairs or other nested combinations with python and its regular expressions.

sent = \'(NP (NNP Hoi) (NN Hallo) (NN H         


        
2条回答
  •  旧时难觅i
    2020-12-05 12:26

    it's actually not possible to do this by using regular expressions, because regular expressions express a language defined by a regular grammar that can be solved by a non finite deterministic automaton, where matching is represented by states ; then to match nested parenthesis, you'd need to be able to match an infinite number of parenthesis and then have an automaton with an infinite number of states.

    To be able to cope with that, we use what's called a push-down automaton, that is used to define the context free grammar.

    So if your regex does not match nested parenthesis, it's because it's expressing the following automaton and does not match anything on your input:

    Regular expression visualization

    Play with it

    As a reference, please have a look at MIT's courses on the topic:

    • http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-045j-automata-computability-and-complexity-spring-2011/lecture-notes/MIT6_045JS11_lec04.pdf
    • http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-005-elements-of-software-construction-fall-2011/lecture-notes/MIT6_005F11_lec05.pdf
    • http://www.saylor.org/site/wp-content/uploads/2012/01/CS304-2.1-MIT.pdf

    So one of the ways to parse your string efficiently, is to build a grammar for nested parenthesis (pip install pyparsing first):

    >>> import pyparsing
    >>> strings = pyparsing.Word(pyparsing.alphanums)
    >>> parens  = pyparsing.nestedExpr( '(', ')', content=strings)
    >>> parens.parseString('(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))').asList()
    [['NP', ['NNP', 'Hoi'], ['NN', 'Hallo'], ['NN', 'Hey'], ['NNP', ['NN', 'Ciao'], ['NN', 'Adios']]]]
    

    N.B.: there exists a few regular expressions engines that do implement nested parenthesis matching using the push down. The default python re engine is not one of them, but an alternative engine exists, called regex (pip install regex) that can do recursive matching (which makes the re engine context free), cf this code snippet:

    >>> import regex
    >>> res = regex.search(r'(?\((?:[^()]++|(?&rec))*\))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))')
    >>> res.captures('rec')
    ['(NNP Hoi)', '(NN Hallo)', '(NN Hey)', '(NN Ciao)', '(NN Adios)', '(NNP (NN Ciao) (NN Adios))', '(NP (NNP Hoi) (NN Hallo) (NN Hey) (NNP (NN Ciao) (NN Adios)))']
    

提交回复
热议问题