Extracting specific leaf value from nltk tree structure with Python

匿名 (未验证) 提交于 2019-12-03 02:49:01

问题:

I have some questions about NLTK's tree functions. I am trying to extract a certain word from the tree structure like the one given below.

test = Tree.parse('(ROOT(SBARQ(WHADVP(WRB How))(SQ(VBP do)(NP (PRP you))(VP(VB ask)(NP(DT a)(JJ total)(NN stranger))(PRT (RP out))(PP (IN on)(NP (DT a)(NN date)))))))')  print "Input tree: ", test print test.leaves()  (SBARQ     (WHADVP (WRB How))     (SQ       (VBP do)       (NP (PRP you))       (VP         (VB ask)         (NP (DT a) (JJ total) (NN stranger))         (PRT (RP out))         (PP (IN on) (NP (DT a) (NN date)))))))  ['How', 'do', 'you', 'ask', 'a', 'total', 'stranger', 'out', 'on', 'a', 'date'] 

I can find a list of all the words using the leaves() function. Is there a way to get a specific leaf only? For example: I would like to get the first/last noun from the NP phrase only? The answer would be 'stranger' for the first noun and 'date' as the last noun.

回答1:

Although noun phrases can be nested inside other types of phrases, I believe most grammars always have nouns in noun phrases. So your question can probably be rephrased as: How do you find the first and last nouns?

You can simply get all tuples of words and POS tags and filter like this,

>>> [word for word,pos in test.pos() if pos=='NN'] ['stranger', 'date'] 

Which in this case is only two so you're done. If you had more nouns, you would just index the list at [0] and [-1].


If you were looking for another POS that could be used in different phrases but you only wanted its use inside a particular one or if you had a strange grammar that allowed nouns outside of NPs, you can do the following...

You can find subtrees of 'NP' by doing,

>>> NPs = list(test.subtrees(filter=lambda x: x.node=='NP')) >>> NPs [Tree('NP', [Tree('PRP', ['you'])]), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['total']), Tree('NN', ['stranger'])]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['date'])])] 

Continuing to narrow down the subtrees, we can use this result to look for 'NN' words,

>>> NNs_inside_NPs = map(lambda x: list(x.subtrees(filter=lambda x: x.node=='NN')), NPs) >>> NNs_inside_NPs [[], [Tree('NN', ['stranger'])], [Tree('NN', ['date'])]] 

So this is a list of lists of all the 'NN's inside each 'NP' phrases. In this case there happens to only be zero or one noun in each phrase.

Now we just need to go through the 'NP's and get all the leaves of the individual nouns (which really means we just want to access the 'stranger' part of Tree('NN', ['stranger'])).

>>> [noun.leaves()[0] for nouns in NNs_inside_NPs for noun in nouns] ['stranger', 'date'] 


标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!