问题
This is part of my project where I need to represent the output after phrase detection like this - (a,x,b) where a, x, b are phrases. I constructed the code and got the output like this:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
I want to make it just like the previous representation which means I have to remove 'CLAUSE', 'NP', 'VP', 'VBD', 'NNP' etc tags.
How to do that?
What I tried
First wrote this in a text file, tokenize and used list.remove('word')
. But that is not at all helpful.
I am clarifying a bit more.
My Input
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
Output will be
[Jack,loved,Peter], [Jack,stayed,in London] The output is just according to the braces and without the tags.
回答1:
Since you tagged this nltk
, let's use the NLTK's tree parser to process your trees. We'll read in each tree, then simply print out the leaves. Done.
>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())
['Jack', 'stayed', 'in', 'London']
The lambda form splits each word/tag
pair and discards the tag, keeping just the word.
Multiple trees
I know, you're going to ask me how to process a whole file's worth of such trees, and some of them take more than one line. That's the job of the NLTK's BracketParseCorpusReader
, but it expects terminals to be in the form (POS word)
instead of word/POS
. I won't bother doing it that way, since it's even easier to trick Tree.fromstring()
into reading all your trees as if they're branches of a single tree:
allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )" # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
print(tree.leaves())
As you see, the only difference is we added "(ROOT "
and " )"
around the file contents, and used a for-loop to generate the output. The loop gives us the children of the top node, i.e. the actual trees.
回答2:
>>> import re
>>> clause = "(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))"
>>> pattern = r'\w+:?(?=\/)'
>>> re.findall(pattern, clause)
['Jack', 'loved', 'Peter']
EDITED
For multiple clauses:
>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP school/NN) (VP is/VBZ situated/VBN) (NP in/IN London/NNP)) (CLAUSE (NP The/DT color/NN of/IN the/DT sky/NN) (VP is/VBZ) (NP pink/NN))"""
>>> [re.findall(pattern, clause) for clause in clauses.split(' (CLAUSE ')]
[['school', 'is', 'situated', 'in', 'London'], ['The', 'color', 'of', 'the', 'sky', 'is', 'pink']]
If clauses are separated by newline:
>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
... (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
... (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))"""
>>> [re.findall(pattern, clause) for clause in clauses.split('\n')]
[['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
To join the output into a string:
>>> " ".join(['Jack', 'loved', 'Peter'])
'Jack loved Peter'
>>> clauses = [['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
>>> [" ".join(cl) for cl in clauses]
['Jack loved Peter', 'Jack stayed in London', 'Tom is in Kolkata']
回答3:
I'm trying something like this:
import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'
tmp = re.split(r'[()/ ]', tmp)
#Use 're.split()' to split by character that was not a letter.
>>> ['', 'CLAUSE', '', 'NP', 'Jack', 'NNP', '', '', 'VP', 'loved', 'VBD', '', '', 'NP', 'Peter', 'NNP', '', '']
result = (tmp[4], tmp[9], tmp[14])
>>> ('Jack', 'loved', 'Peter')
Is this what you want?
EDIT:
I should thought it through:(.
import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'
tmp = re.sub(r'[()]', '', tmp)
>>> 'CLAUSE NP Jack/NNP VP loved/VBD NP Peter/NNP'
result = re.findall(r'[a-zA-Z]*/', tmp)
>>> ['Jack/', 'loved/', 'Peter/']
#Now create a generator.
gen = (i[:-1] for i in result)
tuple(gen)
>>> ('Jack', 'loved', 'Peter')
回答4:
When the outputs are these:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
It is quite obvious that they are tree. So just use Tree.leaves().
Here is the full code:
def leaves(self):
"""
Return the leaves of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))")
>>> t.leaves()
['the', 'dog', 'chased', 'the', 'cat']
:return: a list containing this tree's leaves.
The order reflects the order of the
leaves in the tree's hierarchical structure.
:rtype: list
"""
leaves = []
for child in self:
if isinstance(child, Tree):
leaves.extend(child.leaves())
else:
leaves.append(child)
return leaves
You can find it from here: http://www.nltk.org/_modules/nltk/tree.html
来源:https://stackoverflow.com/questions/33705555/how-can-i-remove-pos-tags-before-slashes-in-nltk