Select Constituents to parse tree representation

问题

Consider we have the spans, corresponding to the sentence,

s = "Our intent is to promote the best alternative he says"
spans = [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]

I delete (0, 3) and (8, 10).

I want to put brackets over, like this:

(((0  1  2)  (3  (4  ((5  6  7)  8))))  9  10)

where 0, 1, ... , 10 are the indices of single-words of the sentence.

For instance, if we were to remove ONLY "he says" and "Our intent is". Here, the span of "Our intent is" corresponds to (0, 3), and the span of "he says" corresponds to (8, 10). Our final tree in bracketed form should look like this:

"(ROOT (S (S (S (S (S Our) (S intent)) (S is) (S (S to) (S (S promote) (S (S (S the) (S best)) (S alternative)))))(S he) (S says))))")

Another instance, if we were to remove ONLY "to promote the best alternative", and "Our intent is to promote the best alternative", Our final tree in bracketed form should look like this:

"(ROOT (S (S (S Our) (S intent)) (S is)) (S to) (S (S promote) (S (S (S the) (S best)) (S alternative))) (S (S he) (S says)))"

We can assume that the full-sentence "Our intent is to promote the best alternative he says" will NEVER be deleted. This is also TRUE for single-words in the sentence, just to give you a background.

I am looking for a way in which we can achieve either/both of

bracketed representation over indices given spans.
bracketed string tree representation with the start symbol, "S" denoting a non-terminal node.

回答1:

Assuming that the spans are given in pre-order (when traversing the tree), I would do a reverse iteration (in-order, with children visited in reversed order). When a span has no overlap with the previously visited span, then they represent siblings, otherwise they have a parent-child relationship. This can be used to steer the recursion in a recursive algorithm. There is also a loop to allow for an arbitrarily number of children (the tree is not necessarily binary):

def to_tree(phrase, spans):
    words = phrase.split()
    iter = reversed(spans)
    current = None

    def encode(start, end):
        return ["(S {})".format(words[k]) for k in range(end - 1, start - 1, -1)]
        
    def recur(start, end, tab=""):
        nonlocal current
        nodes = []
        current = next(iter, None)
        while current and current[1] > start:  # overlap => it's a child
            child_start, child_end = current
            assert child_start >= start and child_end <= end, "Invalid spans"
            # encode what comes at the right of this child (single words):
            nodes.extend(encode(child_end, end))
            # encode the child itself using recursion
            nodes.append(recur(child_start, child_end, tab+"  "))
            end = child_start
        nodes.extend(encode(start, end))
        return "(S {})".format(" ".join(reversed(nodes))) if len(nodes) > 1 else nodes[0]

    return "(ROOT {})".format(recur(0, len(words)))

You would call it like so:

phrase = "Our intent is to promote the best alternative he says"
spans = [(0, 2), (0, 3), (5, 7), (5, 8), (4, 8), (3, 8), (0, 8), (8, 10)]
print(to_tree(phrase, spans))

The output is not exactly the same for the examples you have given. This code will never produce a nested like (S (S ... )), which would represent a node with exactly one child. In that case this code will just generate one level (S ... ). On the other hand, the root will always start out with a (S ... ) that wraps all other nodes.

来源：https://stackoverflow.com/questions/65476660/select-constituents-to-parse-tree-representation

标签

python

string

nltk