How to use nltk regex pattern to extract a specific phrase chunk?

前端 未结 1 1378
走了就别回头了
走了就别回头了 2020-12-08 17:32

I have written the following regex to tag certain phrases pattern

pattern = \"\"\"
        P2: {+ ? * + * &         


        
相关标签:
1条回答
  • 2020-12-08 17:54

    Firstly, let's take a look at the POS tags that NLTK gives:

    >>> from nltk import pos_tag
    >>> sent = 'The pizza was awesome and brilliant'.split()
    >>> pos_tag(sent)
    [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')]
    >>> sent = 'The pizza was good but pasta was bad'.split()
    >>> pos_tag(sent)
    [('The', 'DT'), ('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ'), ('but', 'CC'), ('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')]
    

    (Note: The above are the outputs from NLTK v3.1 pos_tag, older version might differ)

    What you want to capture is essentially:

    • NN VBD JJ CC JJ
    • NN VBD JJ

    So let's catch them with these patterns:

    >>> from nltk import RegexpParser
    >>> sent1 = ['The', 'pizza', 'was', 'awesome', 'and', 'brilliant']
    >>> sent2 = ['The', 'pizza', 'was', 'good', 'but', 'pasta', 'was', 'bad']
    >>> patterns = """
    ... P: {<NN><VBD><JJ><CC><JJ>}
    ... {<NN><VBD><JJ>}
    ... """
    >>> PChunker = RegexpParser(patterns)
    >>> PChunker.parse(pos_tag(sent1))
    Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
    >>> PChunker.parse(pos_tag(sent2))
    Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
    

    So that's "cheating" by hardcoding!!!

    Let's go back to the POS patterns:

    • NN VBD JJ CC JJ
    • NN VBD JJ

    Can be simplified to:

    • NN VBD JJ (CC JJ)

    So you can use the optional operators in the regex, e.g.:

    >>> patterns = """
    ... P: {<NN><VBD><JJ>(<CC><JJ>)?}
    ... """
    >>> PChunker = RegexpParser(patterns)
    >>> PChunker.parse(pos_tag(sent1))
    Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('awesome', 'JJ'), ('and', 'CC'), ('brilliant', 'JJ')])])
    >>> PChunker.parse(pos_tag(sent2))
    Tree('S', [('The', 'DT'), Tree('P', [('pizza', 'NN'), ('was', 'VBD'), ('good', 'JJ')]), ('but', 'CC'), Tree('P', [('pasta', 'NN'), ('was', 'VBD'), ('bad', 'JJ')])])
    

    Most probably you're using the old tagger, that's why your patterns are different but I guess you see how you could capture the phrases you need using the example above.

    The steps are:

    • First, check what is the POS patterns using the pos_tag
    • Then generalize patterns and simplify them
    • Then put them into the RegexpParser
    0 讨论(0)
提交回复
热议问题