Converting POS tags from TextBlob into Wordnet compatible inputs

守給你的承諾、 提交于 2019-12-23 06:07:13

问题


I'm using Python and nltk + Textblob for some text analysis. It's interesting that you can add a POS for wordnet to make your search for synonyms more specific, but unfortunately the tagging in both nltk and Textblob aren't "compatible" with the kind of input that wordnet expects for it's synset class.

Example Wordnet.synsets() requires that the POS you give it is one of n,v,a,r, like so

wn.synsets("dog", POS="n,v,a,r")

But a standard POS tagging from upenn_treebank looks like

JJ, VBD, VBZ, etc.

So I'm looking for a good way to convert between the two.

Does anyone know of a good way to make this conversion happen, besides brute force?


回答1:


If textblob is using the PennTreeBank (ptb) tagset, then just use the first character in the POS tag to map to the WN pos tag.

WN POS tagset includes 'a' = adjective/adverbs, 's'=satelite adjective, 'n' = nouns and 'v' = verbs.

try:

>>> from nltk import word_tokenize, pos_tag
>>> from nltk.corpus import wordnet as wn
>>> text = 'this is a pos tagset in some foo bar paradigm'
>>> pos_tag(word_tokenize(text))
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('pos', 'NN'), ('tagset', 'NN'), ('in', 'IN'), ('some', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('paradigm', 'NN')]
>>> for tok, pos in pos_tag(word_tokenize(text)):
...     pos = pos[0].lower()
...     if pos in ['a', 'n', 'v']:
...             wn.synsets(tok, pos)
... 
[Synset('be.v.01'), Synset('be.v.02'), Synset('be.v.03'), Synset('exist.v.01'), Synset('be.v.05'), Synset('equal.v.01'), Synset('constitute.v.01'), Synset('be.v.08'), Synset('embody.v.02'), Synset('be.v.10'), Synset('be.v.11'), Synset('be.v.12'), Synset('cost.v.01')]
[Synset('polonium.n.01'), Synset('petty_officer.n.01'), Synset('po.n.03'), Synset('united_states_post_office.n.01')]
[]
[]
[Synset('barroom.n.01'), Synset('bar.n.02'), Synset('bar.n.03'), Synset('measure.n.07'), Synset('bar.n.05'), Synset('prevention.n.01'), Synset('bar.n.07'), Synset('bar.n.08'), Synset('legal_profession.n.01'), Synset('stripe.n.05'), Synset('cake.n.01'), Synset('browning_automatic_rifle.n.01'), Synset('bar.n.13'), Synset('bar.n.14'), Synset('bar.n.15')]
[Synset('paradigm.n.01'), Synset('prototype.n.01'), Synset('substitution_class.n.01'), Synset('paradigm.n.04')]


来源:https://stackoverflow.com/questions/24975499/converting-pos-tags-from-textblob-into-wordnet-compatible-inputs

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!