问题
I am trying to use NLTK toolkit to get extract place, date and time from text messages. I just installed the toolkit on my machine and I wrote this quick snippet to test it out:
sentence = "Let's meet tomorrow at 9 pm";
tokens = nltk.word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)
print nltk.ne_chunk(pos_tags, binary=True)
I was assuming that it will identify the date (tomorrow) and time (9 pm). But, surprisingly it failed to recognize that. I get the following result when I run my above code:
(S (GPE Let/NNP) 's/POS meet/NN tomorrow/NN at/IN 9/CD pm/NN)
Can someone help me understand if I am missing something or NLTK is just not mature enough to tag time and date properly. Thanks!
回答1:
The default NE chunker in nltk is a maximum entropy chunker trained on the ACE corpus (http://catalog.ldc.upenn.edu/LDC2005T09). It has not been trained to recognise dates and times, so you need to train your own classifier if you want to do that.
Have a look at http://mattshomepage.com/articles/2016/May/23/nltk_nec/, the whole process is explained very well.
Also, there is a module called timex in nltk_contrib which might help you with your needs. https://github.com/nltk/nltk_contrib/blob/master/nltk_contrib/timex.py
回答2:
Named entity recognition is not an easy problem, do not expect any library to be 100% accurate. You shouldn't make any conclusions about NLTK's performance based on one sentence. Here's another example:
sentence = "I went to New York to meet John Smith";
I get
(S
I/PRP
went/VBD
to/TO
(NE New/NNP York/NNP)
to/TO
meet/VB
(NE John/NNP Smith/NNP))
As you can see, NLTK does very well here. However, I couldn't get NLTK to recognise today
or tomorrow
as temporal expressions. You can try Stanford SUTime, it is a part of Stanford CoreNLP- I have used it before I it works quite well (it is in Java though).
回答3:
If you wish to correctly identify the date or time from the text messages you can use Stanford's NER.
It uses the CRF(Conditional Random Fields) Classifier. CRF is a sequential classifier. So it takes the sequences of words into consideration.
How you frame or design a sentence, accordingly you will get the classified data.
If your input sentence would have been Let's meet on wednesday at 9am.
, then Stanford NER would have correctly identified wednesday
as date and 9am
as time.
NLTK supports Stanford NER. Try using it.
来源:https://stackoverflow.com/questions/19312573/nltk-for-named-entity-recognition