This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))

I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.

How to do that?

What I tried

First wrote this in a text file, tokenize and used list.remove('word'). But that is not at all helpful. I am clarifying a bit more.

My Input

(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))

Output will be

[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.

Since you tagged this nltk, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.

>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())

['Jack', 'stayed', 'in', 'London']

The lambda form splits each word/tag pair and discards the tag, keeping just the word.

Multiple trees

I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader, but it expects terminals to be in the form (POS word) instead of word/POS. I won't bother doing it that way, since it's even easier to trick Tree.fromstring() into reading all your trees as if they're branches of a single tree:

allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )"  # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
    print(tree.leaves())

As you see, the only difference is we added "(ROOT " and " )" around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.

>>> import re
>>> clause = "(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))"
>>> pattern = r'\w+:?(?=\/)'
>>> re.findall(pattern, clause)
['Jack', 'loved', 'Peter']

EDITED

For multiple clauses:

>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP school/NN) (VP is/VBZ situated/VBN) (NP in/IN London/NNP)) (CLAUSE (NP The/DT color/NN of/IN the/DT sky/NN) (VP is/VBZ) (NP pink/NN))"""
>>> [re.findall(pattern, clause) for clause in clauses.split(' (CLAUSE ')]
[['school', 'is', 'situated', 'in', 'London'], ['The', 'color', 'of', 'the', 'sky', 'is', 'pink']]

If clauses are separated by newline:

>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
... (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
... (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))"""
>>> [re.findall(pattern, clause) for clause in clauses.split('\n')]
[['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]

To join the output into a string:

>>> " ".join(['Jack', 'loved', 'Peter'])
'Jack loved Peter'

>>> clauses = [['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
>>> [" ".join(cl) for cl in clauses]
['Jack loved Peter', 'Jack stayed in London', 'Tom is in Kolkata']

I'm trying something like this:

import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'

tmp = re.split(r'[()/ ]', tmp)
#Use 're.split()' to split by character that was not a letter.
>>> ['', 'CLAUSE', '', 'NP', 'Jack', 'NNP', '', '', 'VP', 'loved', 'VBD', '', '', 'NP', 'Peter', 'NNP', '', '']

result = (tmp[4], tmp[9], tmp[14])
>>> ('Jack', 'loved', 'Peter')

Is this what you want?

EDIT:

I should thought it through:(.

import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'

tmp = re.sub(r'[()]', '', tmp)
>>> 'CLAUSE NP Jack/NNP VP loved/VBD NP Peter/NNP'
result = re.findall(r'[a-zA-Z]*/', tmp)
>>> ['Jack/', 'loved/', 'Peter/']
＃Now create a generator.
gen = (i[:-1] for i in result)
tuple(gen)
>>> ('Jack', 'loved', 'Peter')

When the outputs are these: (CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP)) (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP)) It is quite obvious that they are tree. So just use Tree.leaves(). Here is the full code:

def leaves(self):
    """
    Return the leaves of the tree.

        >>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))")
        >>> t.leaves()
        ['the', 'dog', 'chased', 'the', 'cat']

    :return: a list containing this tree's leaves.
        The order reflects the order of the
        leaves in the tree's hierarchical structure.
    :rtype: list
    """
    leaves = []
    for child in self:
        if isinstance(child, Tree):
            leaves.extend(child.leaves())
        else:
            leaves.append(child)
    return leaves

You can find it from here: http://www.nltk.org/_modules/nltk/tree.html

来源：https://stackoverflow.com/questions/33705555/how-can-i-remove-pos-tags-before-slashes-in-nltk

标签

python