n-grams with Naive Bayes classifier

試著忘記壹切 提交于 2019-11-28 21:43:56
user823743

A bigram feature vector follows the exact same principals as a unigram feature vector. So, just like the tutorial you mentioned you will have to check if a bigram feature is present in any of the documents you will use.

As for the bigram features and how to extract them, I have written the code bellow for it. You can simply adopt them to change the variable "tweets" in the tutorial.

import nltk
text = "Hi, I want to get the bigram list of this string"
for item in nltk.bigrams (text.split()): print ' '.join(item)

Instead of printing them you can simply append them to the "tweets" list and you are good to go! I hope this would be helpful enough. Otherwise, let me know if you still have problems.

Please note that in applications like sentiment analysis some researchers tend to tokenize the words and remove the punctuation and some others don't. From experince I know that if you don't remove punctuations, Naive bayes works almost the same, however an SVM would have a decreased accuracy rate. You might need to play around with this stuff and decide what works better on your dataset.

Edit 1:

There is a book named "Natural language processing with Python" which I can recommend it to you. It contains examples of bigrams as well as some exercises. However, I think you can even solve this case without it. The idea behind selecting bigrams a features is that we want to know the probabilty that word A would appear in our corpus followed by the word B. So, for example in the sentence

"I drive a truck"

the word unigram features would be each of those 4 words while the word bigram features would be:

["I drive", "drive a", "a truck"]

Now you want to use those 3 as your features. So the code function bellow puts all bigrams of a string in a list named bigramFeatureVector.

def bigramReturner (tweetString):
  tweetString = tweetString.lower()
  tweetString = removePunctuation (tweetString)
  bigramFeatureVector = []
  for item in nltk.bigrams(tweetString.split()):
      bigramFeatureVector.append(' '.join(item))
  return bigramFeatureVector

Note that you have to write your own removePunctuation function. What you get as output of the above function is the bigram feature vector. You will treat it exactly the same way the unigram feature vectors are treated in the tutorial you mentioned.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!