extract all noun phrases from stanford parser output textfile using bash

大兔子大兔子 提交于 2020-02-04 05:41:28

问题


As a result of running Stanford Parser, I have output files contain a Penn Treebank structured format. Each file contains the following.

(ROOT
  (S
    (S
      (NP
        (NP (DT The) (JJS strongest) (NN rain))
        (VP
          (ADVP (RB ever))
          (VBN recorded)
          (PP (IN in)
            (NP (NNP India)))))
      (VP
        (VP (VBD shut)
          (PRT (RP down))
          (NP
            (NP (DT the) (JJ financial) (NN hub))
            (PP (IN of)
              (NP (NNP Mumbai)))))
        (, ,)
        (VP (VBD snapped)
          (NP (NN communication) (NNS lines)))
        (, ,)
        (VP (VBD closed)
          (NP (NNS airports)))
        (CC and)
        (VP (VBD forced)
          (NP
            (NP (NNS thousands))
            (PP (IN of)
              (NP (NNS people))))
          (S
            (VP (TO to)
              (VP
                (VP (VB sleep)
                  (PP (IN in)
                    (NP (PRP$ their) (NNS offices))))
                (CC or)
                (VP (VB walk)
                  (NP (NN home))
                  (PP (IN during)
                    (NP (DT the) (NN night))))))))))
    (, ,)
    (NP (NNS officials))
    (VP (VBD said)
      (NP-TMP (NN today)))
    (. .)))

Now, I want to use a script in bash to extract all noun phrases using bash. I know there is a way to implement what I want in Java. I am not sure how to figure out the way to work by reading a text file into a tree.


回答1:


Here's a quick and dirty Awk script to extract the outermost NP subtrees. If you want the internals as well, you'll need a properly recursive solution.

awk -v RS='[ \t\n]+' '
    !np && /^\(NP$/ { np=paren }
    /^\(/ { ++paren }
    /\)/ { b=$0; c=""; while (sub(/\)$/, "", b)) {paren--; c=c ")"
        if (np && paren == np) {
            d=b; gsub(/\)+$/, "", d); print a " " d c; np=0; a=c="" } } }
    np { a=a (a ? " " : "") $0 }'



回答2:


Stanford provides another tool called tregex which operates on parse trees and will extract subtrees based on a query language similar to regular expressions.

http://nlp.stanford.edu/software/tregex.shtml http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/TregexPattern.html

This tool can be run from the command line.



来源:https://stackoverflow.com/questions/27291367/extract-all-noun-phrases-from-stanford-parser-output-textfile-using-bash

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!