extract all noun phrases from stanford parser output textfile using bash

问题

As a result of running Stanford Parser, I have output files contain a Penn Treebank structured format. Each file contains the following.

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

Now, I want to use a script in bash to extract all noun phrases using bash. I know there is a way to implement what I want in Java. I am not sure how to figure out the way to work by reading a text file into a tree.

回答1:

Here's a quick and dirty Awk script to extract the outermost NP subtrees. If you want the internals as well, you'll need a properly recursive solution.

awk -v RS='[ \t\n]+' '
    !np && /^\(NP$/ { np=paren }
    /^\(/ { ++paren }
    /\)/ { b=$0; c=""; while (sub(/\)$/, "", b)) {paren--; c=c ")"
        if (np && paren == np) {
            d=b; gsub(/\)+$/, "", d); print a " " d c; np=0; a=c="" } } }
    np { a=a (a ? " " : "") $0 }'

回答2:

Stanford provides another tool called tregex which operates on parse trees and will extract subtrees based on a query language similar to regular expressions.

http://nlp.stanford.edu/software/tregex.shtml http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/TregexPattern.html

This tool can be run from the command line.

来源：https://stackoverflow.com/questions/27291367/extract-all-noun-phrases-from-stanford-parser-output-textfile-using-bash

标签

bash

stanford-nlp