问题
As a result of running Stanford Parser, I have output files contain a Penn Treebank structured format. Each file contains the following.
(ROOT
(S
(S
(NP
(NP (DT The) (JJS strongest) (NN rain))
(VP
(ADVP (RB ever))
(VBN recorded)
(PP (IN in)
(NP (NNP India)))))
(VP
(VP (VBD shut)
(PRT (RP down))
(NP
(NP (DT the) (JJ financial) (NN hub))
(PP (IN of)
(NP (NNP Mumbai)))))
(, ,)
(VP (VBD snapped)
(NP (NN communication) (NNS lines)))
(, ,)
(VP (VBD closed)
(NP (NNS airports)))
(CC and)
(VP (VBD forced)
(NP
(NP (NNS thousands))
(PP (IN of)
(NP (NNS people))))
(S
(VP (TO to)
(VP
(VP (VB sleep)
(PP (IN in)
(NP (PRP$ their) (NNS offices))))
(CC or)
(VP (VB walk)
(NP (NN home))
(PP (IN during)
(NP (DT the) (NN night))))))))))
(, ,)
(NP (NNS officials))
(VP (VBD said)
(NP-TMP (NN today)))
(. .)))
Now, I want to use a script in bash to extract all noun phrases using bash. I know there is a way to implement what I want in Java. I am not sure how to figure out the way to work by reading a text file into a tree.
回答1:
Here's a quick and dirty Awk script to extract the outermost NP
subtrees. If you want the internals as well, you'll need a properly recursive solution.
awk -v RS='[ \t\n]+' '
!np && /^\(NP$/ { np=paren }
/^\(/ { ++paren }
/\)/ { b=$0; c=""; while (sub(/\)$/, "", b)) {paren--; c=c ")"
if (np && paren == np) {
d=b; gsub(/\)+$/, "", d); print a " " d c; np=0; a=c="" } } }
np { a=a (a ? " " : "") $0 }'
回答2:
Stanford provides another tool called tregex which operates on parse trees and will extract subtrees based on a query language similar to regular expressions.
http://nlp.stanford.edu/software/tregex.shtml http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/trees/tregex/TregexPattern.html
This tool can be run from the command line.
来源:https://stackoverflow.com/questions/27291367/extract-all-noun-phrases-from-stanford-parser-output-textfile-using-bash