Generating PCFG from Universal tagset [duplicate]

江枫思渺然 提交于 2019-12-24 10:49:31

问题


I am trying to build a PCFG using the POS tags obtained from the below code:

from nltk.corpus import treebank

corpus = treebank.tagged_sents(tagset='universal')
tags = set()

for sent in corpus:
    for (word, tag) in sent: 
        tags.add(tag)

tags = list(tags)
print tags

Gives,

['ADV', 'NOUN', 'ADP', 'PRON', 'DET', '.', 'PRT', 'NUM', 'X', 'CONJ', 'ADJ', 'VERB']

I need to generate a PCFG using the POS tags above. But, when I try to construct a grammar using the rule

nltk.grammar.PCFG.fromstring("""T5 -> . NT6 [0.136235]""")

it produces

ValueError: Unable to parse line 1: T5 -> . NT6 [0.136235]
Expected a nonterminal, found: . NT6 [0.136235]

I assume that the exception indicates that "." is not a valid non-terminal in nltk.grammar.PCFG. But I am wondering if there is a neat way to fix this.

Related

nltk cant interpret grammar category PRP$ output by stanford parser gives a nice fix to add '$' from treebank tagset to the grammar. But then again treebank POS tagset contains single quotes (' ') as a POS tag, which is not a valid symbol.

Is there a neat work around for this problem without having to add each special character to the grammar?


回答1:


I got the answer to this question. Instead of using fromstring method, generate PCFG object by passing a list of nltk.ProbabilisticProduction objects and an nltk.Nonterminal object as below:

from nltk import ProbabilisticProduction 
from nltk.grammar import PCFG
from nltk import  Nonterminal as NT

g = ProbabilisticProduction(NT('TS'), [NT('.'), NT('NT6')], prob=1)

# Adding a terminal production
g = ProbabilisticProduction(NT('NT6'), ['terminal'], prob = 1)

start = NT('Q0')  # Q0 is the start symbol for my grammar
PCFG(start, [g]) # Takes a list of ProbabilisticProductions


来源:https://stackoverflow.com/questions/43037697/generating-pcfg-from-universal-tagset

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!