How to extract noun phrases from the parsed text

I have parsed a text with constituency parser copy the result in a text file like below:

(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP we)) (VP (VBD went) (PP (TO to)....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (PRP I)) (VP (VBD was) (NP (NP (EX...
(ROOT (S (NP (NN Yesterday)) (, ,) (NP (PRP I)) (VP (VBD went) (PP (TO to.....
(ROOT (FRAG (SBAR (SBAR (IN While) (S (NP (NNP Jim)) (VP (VBD was) (NP (NP (....
(ROOT (S (S (NP (PRP I)) (VP (VBD started) (S (VP (VBG talking) (PP.....

I need to extract all NounPhrases (NP) from this text file. I wrote the following code that extract only the first NP from each line. However, I need to extract all noun phrases. My Code is:

public class nounPhrase {

    public static int findClosingParen(char[] text, int openPos) {
        int closePos = openPos;
        int counter = 1;
        while (counter > 0) {
            char c = text[++closePos];
            if (c == '(') {

                counter++;
            }
            else if (c == ')') {
                counter--;
            }
        }
        return closePos;
    }

     public static void main(String[] args) throws IOException {

        ArrayList npList = new ArrayList ();
        String line;
        String line1;
        int np;

        String Input = "/local/Input/Temp/Temp.txt";

        String Output = "/local/Output/Temp/Temp-out.txt";  

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"
        ));
        while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        np = findClosingParen (lineArray, line.indexOf("(NP"));
        line1 = line.substring(line.indexOf("(NP"),np+1);
        System.out.print(line1+"\n");
        }
    }
}

The output is:

(NP (NN Yesterday))...I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also
(NP (NNP Jim)).....I need other NPs in this line also
(NP (PRP I)).....I need other NPs in this line also

My code only takes the first NP on each line with its closing parenthesis but I need to extract all NPs from the text.

While writing your own tree parser is a good exercise (!), if you just want results, the easiest way is to use more of the functionality of the Stanford NLP tools, namely Tregex, which is designed for just such things. You can change your final while loop to something like this:

TregexPattern tPattern = TregexPattern.compile("NP");
while ((line = br.readLine()) != null) {
    Tree t = Tree.valueOf(line);
    TregexMatcher tMatcher = tPattern.matcher(t);
    while (tMatcher.find()) {
      System.out.println(tMatcher.getMatch());
    }
}

Here you go. I changed it around a little bit and it got messy but I can clean it up if you really need the code pretty.

import java.io.*;
import java.util.*;

public class nounPhrase {
    public static void main(String[] args)throws IOException{

        ArrayList<String> npList = new ArrayList<String>();
        String line = "";
        String line1 = "";

        String Input = "/local/Input/Temp/Temp.txt";
        String Output = "/local/Output/Temp/Temp-out.txt";

        FileInputStream  fis = new FileInputStream (Input);
        BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));

        while ((line = br.readLine()) != null){
            char[] lineArray = line.toCharArray();
            int temp;
            for (int i=0; i+2<lineArray.length; i++){
                if(lineArray[i]=='(' && lineArray[i+1]=='N' && lineArray[i+2]=='P'){
                    temp = i;
                    while(lineArray[i] != ')'){
                        i++;
                    }
                    i+=2;
                    line1 = line.substring(temp,i);
                    npList.add(line1);
                }
            }
            npList.add("*");
        }

        for (int i=0; i<npList.size(); i++){
            if(!(npList.get(i).equals("*"))){
                System.out.print(npList.get(i));
                if(i<npList.size()-1 && npList.get(i+1).equals("*")){
                    System.out.println();
                }
            }
        }
    }
}

and fyi the main reason your code only selected the first occurrence of the NP is because you used the indexOf method to find the location. IndexOf ALWAYS and ONLY takes the first occurrence of a String you are searching for.

You have to iterate on parse tree and change index for Noun Phrase after getting first NP Phrase, simple approach can be just substring your line variable and start index of this substring will be np+1. Following are the changes you can make into your code:

while ((line = br.readLine())!= null){
        char[] lineArray = line.toCharArray();
        int indexOfNP = line.indexOf("(NP");
        while(indexOfNP!=-1) {
            np = findClosingParen(lineArray, indexOfNP);
            line1 = line.substring(indexOfNP, np + 1);
            System.out.print(line1 + "\n");
            npList.add(line1);
            line = line.substring(np+1);
            indexOfNP = line.indexOf("(NP");
            lineArray = line.toCharArray();
        }
}

For Recursive Solution:

public static void main(String[] args) throws IOException {

    ArrayList<String> npList = new ArrayList<String>();
    String line;
    String Input = "/local/Input/Temp/Temp.txt";
    String Output = "/local/Output/Temp/Temp-out.txt";

    FileInputStream fis = new FileInputStream (Input);
    BufferedReader br = new BufferedReader(new InputStreamReader(fis,"UTF-8"));
    while ((line = br.readLine())!= null){
        int indexOfNP = line.indexOf("(NP");
        if(indexOfNP>=0)
            extractNPs(npList,line,indexOfNP);
    }

    for(String npString:npList){
        System.out.println(npString);
    }

    br.close();
    fis.close();

}

public static ArrayList<String> extractNPs(ArrayList<String> arr,String  
                                                   parse, int indexOfNP){
    if(indexOfNP==-1){
        return arr;
    }
    else{
        int npIndex = findClosingParen(parse.toCharArray(), indexOfNP);
        String mainNP = new String(parse.substring(indexOfNP, npIndex + 1));
        arr.add(mainNP);
        //Uncomment Lines below if you also want MainNP along with all NPs     
        //within MainNP to be extracted
        /*
        mainNP = new String(mainNP.substring(3));
        if(mainNP.indexOf("(NP")>0){
            return extractNPs(arr,mainNP,mainNP.indexOf("(NP"));
        }
        */
        parse = new String(parse.substring(npIndex+1));
        indexOfNP = parse.indexOf("(NP");
        return extractNPs(arr,parse,indexOfNP);
    }
}

You are building a parser (.. for the code your natural language parser generated), which is a subject with wide academic documentation. The most simple parser you can build is a LL parser. Have a look at this artcle from wikipedia, that has some pretty good examples for you to get inspired: http://en.wikipedia.org/wiki/LL_parser

The wikipedia entry on parsing in general might give you a taste of the field of parsing in general: Wikipedia article : http://en.wikipedia.org/wiki/Parsing

来源：https://stackoverflow.com/questions/29179453/how-to-extract-noun-phrases-from-the-parsed-text

标签

java

stanford-nlp