How can I remove POS tags before slashes in nltk?

一笑奈何 提交于 2019-12-05 15:02:52

Since you tagged this nltk, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.

>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())

['Jack', 'stayed', 'in', 'London']

The lambda form splits each word/tag pair and discards the tag, keeping just the word.

Multiple trees

I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader, but it expects terminals to be in the form (POS word) instead of word/POS. I won't bother doing it that way, since it's even easier to trick Tree.fromstring() into reading all your trees as if they're branches of a single tree:

allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )"  # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
    print(tree.leaves())

As you see, the only difference is we added "(ROOT " and " )" around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.

>>> import re
>>> clause = "(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))"
>>> pattern = r'\w+:?(?=\/)'
>>> re.findall(pattern, clause)
['Jack', 'loved', 'Peter']

EDITED

For multiple clauses:

>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP school/NN) (VP is/VBZ situated/VBN) (NP in/IN London/NNP)) (CLAUSE (NP The/DT color/NN of/IN the/DT sky/NN) (VP is/VBZ) (NP pink/NN))"""
>>> [re.findall(pattern, clause) for clause in clauses.split(' (CLAUSE ')]
[['school', 'is', 'situated', 'in', 'London'], ['The', 'color', 'of', 'the', 'sky', 'is', 'pink']]

If clauses are separated by newline:

>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
... (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
... (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))"""
>>> [re.findall(pattern, clause) for clause in clauses.split('\n')]
[['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]

To join the output into a string:

>>> " ".join(['Jack', 'loved', 'Peter'])
'Jack loved Peter'

>>> clauses = [['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
>>> [" ".join(cl) for cl in clauses]
['Jack loved Peter', 'Jack stayed in London', 'Tom is in Kolkata']

I'm trying something like this:

import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'

tmp = re.split(r'[()/ ]', tmp)
#Use 're.split()' to split by character that was not a letter.
>>> ['', 'CLAUSE', '', 'NP', 'Jack', 'NNP', '', '', 'VP', 'loved', 'VBD', '', '', 'NP', 'Peter', 'NNP', '', '']

result = (tmp[4], tmp[9], tmp[14])
>>> ('Jack', 'loved', 'Peter')

Is this what you want?

EDIT:

I should thought it through:(.

import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'

tmp = re.sub(r'[()]', '', tmp)
>>> 'CLAUSE NP Jack/NNP VP loved/VBD NP Peter/NNP'
result = re.findall(r'[a-zA-Z]*/', tmp)
>>> ['Jack/', 'loved/', 'Peter/']
#Now create a generator.
gen = (i[:-1] for i in result)
tuple(gen)
>>> ('Jack', 'loved', 'Peter')

When the outputs are these: (CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP)) (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP)) (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP)) It is quite obvious that they are tree. So just use Tree.leaves(). Here is the full code:

def leaves(self):
    """
    Return the leaves of the tree.

        >>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))")
        >>> t.leaves()
        ['the', 'dog', 'chased', 'the', 'cat']

    :return: a list containing this tree's leaves.
        The order reflects the order of the
        leaves in the tree's hierarchical structure.
    :rtype: list
    """
    leaves = []
    for child in self:
        if isinstance(child, Tree):
            leaves.extend(child.leaves())
        else:
            leaves.append(child)
    return leaves

You can find it from here: http://www.nltk.org/_modules/nltk/tree.html

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!