nltk StanfordNERTagger : How to get proper nouns without capitalization

为君一笑 提交于 2019-11-30 10:04:23
alvas

Firstly, see your other question to setup Stanford CoreNLP to be called from command-line or python: nltk : How to prevent stemming of proper nouns.

For the proper cased sentence we see that the NER works properly:

>>> from corenlp import StanfordCoreNLP
>>> nlp = StanfordCoreNLP('http://localhost:9000')
>>> text = ('John Donk works POI Jones wants meet Xyz Corp measuring POI short term performance metrics. '
... 'john donk works poi jones wants meet xyz corp measuring poi short term performance metrics')
>>> output = nlp.annotate(text, properties={'annotators': 'tokenize,ssplit,pos,ner',  'outputFormat': 'json'})
>>> annotated_sent0 = output['sentences'][0]
>>> annotated_sent1 = output['sentences'][1]
>>> for token in annotated_sent0['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
John John NNP PERSON
Donk Donk NNP PERSON
works work VBZ O
POI POI NNP ORGANIZATION
Jones Jones NNP ORGANIZATION
wants want VBZ O
meet meet VB O
Xyz Xyz NNP ORGANIZATION
Corp Corp NNP ORGANIZATION
measuring measure VBG O
POI poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O
. . . O

And for the lowered cased sentence, you will not get NNP for POS tag nor any NER tag:

>>> for token in annotated_sent1['tokens']:
...     print token['word'], token['lemma'], token['pos'], token['ner']
... 
john john NN O
donk donk JJ O
works work NNS O
poi poi VBP O
jones jone NNS O
wants want VBZ O
meet meet VB O
xyz xyz NN O
corp corp NN O
measuring measure VBG O
poi poi NN O
short short JJ O
term term NN O
performance performance NN O
metrics metric NNS O

So the question to your question should be:

  • What is the ultimate aim of your NLP application?
  • Why is your input lower-cased? Was it your doing or how the data was provided?

And after answering those questions, you can move on to decide what you really want to do with the NER tags, i.e.

  • If the input is lower-cased and it's because of how you structured your NLP tool chain, then

    • DO NOT do that!!! Perform the NER on the normal text without distortions you've created. It's because the NER was trained on normal text so it won't really work out of the context of normal text.
    • Also try to not mix it NLP tools from different suites they will usually not play nice, especially at the end of your NLP tool chain
  • If the input is lower-cased because that's how the original data was, then:

  • If the input has erroneous casing, e.g. `Some big Some Small but not all are Proper Noun, then

    • Try the truecasing solution too.

First you should not use predefined keywords in your program as variable names. Avoid using str as a variable name. Instead use newstring or anything else.

In your update you are passing each lower case word to the POS tagger. the tag() method splits each string passed to it and gives POS tagging for each character.

So i suggest you pass a list rather than a word to the tag() method. The list will contain only one word at a time.

You can try it like: w = stp.tag([wl]) w will be a list with two items [w1,POS]

In this way you can tag a single word

But in this case it gives POS tag of john as NN

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!