How to get a parse in a bracketed format (without POS tags)?

问题

I want to parse a sentence to a binary parse of this form (Format used in the SNLI corpus):

sentence:"A person on a horse jumps over a broken down airplane."

parse: ( ( ( A person ) ( on ( a horse ) ) ) ( ( jumps ( over ( a ( broken ( down airplane ) ) ) ) ) . ) )

I'm unable to find a parser which does this.

note: This question has been asked earlier(How to get a binary parse in Python). But the answers are not helpful. And I was unable to comment because I do not have the required reputation.

回答1:

Here is some sample code which will erase the labels for each node in the tree.

package edu.stanford.nlp.examples;

import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.util.*;

import java.util.*;

public class PrintTreeWithoutLabelsExample {

  public static void main(String[] args) {
    // set up pipeline properties
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit,pos,lemma,parse");
    // use faster shift reduce parser
    props.setProperty("parse.model", "edu/stanford/nlp/models/srparser/englishSR.ser.gz");
    props.setProperty("parse.maxlen", "100");
    props.setProperty("parse.binaryTrees", "true");
    // set up Stanford CoreNLP pipeline
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    // build annotation for text
    Annotation annotation = new Annotation("The red car drove on the highway.");
    // annotate the review
    pipeline.annotate(annotation);
    for (CoreMap sentence : annotation.get(CoreAnnotations.SentencesAnnotation.class)) {
      Tree sentenceConstituencyParse = sentence.get(TreeCoreAnnotations.TreeAnnotation.class);
      for (Tree subTree : sentenceConstituencyParse.subTrees()) {
        if (!subTree.isLeaf())
          subTree.setLabel(CoreLabel.wordFromString(""));
      }
      TreePrint treePrint = new TreePrint("oneline");
      treePrint.printTree(sentenceConstituencyParse);
    }
  }
}

回答2:

I analyzed the accepted version and as I needed something in python, I made a simple function, that creates the same results. For parsing the sentences I adapted the version found at the referenced link.

import re
import string
from stanfordcorenlp import StanfordCoreNLP
from nltk import Tree
from functools import reduce
regex = re.compile('[%s]' % re.escape(string.punctuation))

def parse_sentence(sentence):
    nlp = StanfordCoreNLP(r'./stanford-corenlp-full-2018-02-27')
    sentence = regex.sub('', sentence)

    result = nlp.parse(sentence)
    result = result.replace('\n', '')
    result = re.sub(' +',' ', result)

    nlp.close() # Do not forget to close! The backend server will consume a lot memery.
    return result.encode("utf-8")

def binarize(parsed_sentence):
    sentence = sentence.replace("\n", "")

    for pattern in ["ROOT", "SINV", "NP", "S", "PP", "ADJP", "SBAR", 
                    "DT", "JJ", "NNS", "VP", "VBP", "RB"]:
        sentence = sentence.replace("({}".format(pattern), "(")

    sentence = re.sub(' +',' ', sentence)
    return sentence

Neither my or the accepted version deliver the same results as presented in the SNLI or MultiNLI corpus, as they gather two single leafs of the tree together to one. An example from the MultiNLI corpus shows

"( ( The ( new rights ) ) ( are ( nice enough ) ) )",

where as booth answers here return

'( ( ( ( The) ( new) ( rights)) ( ( are) ( ( nice) ( enough)))))'.

I am not an expert in NLP, so I hope this does not make any difference. At least it does not for my applications.

来源：https://stackoverflow.com/questions/49685032/how-to-get-a-parse-in-a-bracketed-format-without-pos-tags

标签

python

nlp

stanford-nlp