text classifier with bag of words and additional sentiment feature in sklearn

。_饼干妹妹 提交于 2019-12-06 02:16:57

One option would be to just add these two new features to your CountVectorizer matrix as columns.

As you are not performing any tf-idf, your count matrix is going to be filled with integers so you could encode your new columns as int values.

You might have to try several encodings but you can start with something like:

  • sentiment [-5,...,5] transformed to [0,...,10]
  • string with topic of sentence. Just assign integers to different topics ({'unicorns':0, 'batman':1, ...}), you can keep a dictionary structure to assign integers and avoid repeating topics.

And just in case you don't know how to add columns to your train_matrix:

dense_matrix = train_matrix.todense() # countvectorizer returns a sparse matrix
np.insert(dense_matrix,dense_matrix.shape[1],[val1,...,valN],axis=1)

note that the column [val1,...,valN] needs to have the same lenght as num. samples you are using

Even though it won't be strictly a Bag of Words anymore (because not all columns represent word frequency), just adding this two columns will add up the extra information you want to include. And naive Bayes classifier considers each of the features to contribute independently to the probability, so we are okay here.

Update: better use a 'one hot' encoder to encode categorical features (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). This way you prevent weird behavior by assigning integer values to your new features (maybe you can still do that with sentiment, because in a scale of sentiment from 0 to 10 you assume that a 9 sentiment is closer to a sample with sentiment 10 rather than another with sentiment 0). But with categorical features you better do the one-hot encoding. So let's say you have 3 topics, then you can use same technique of adding columns only now you have to add 3 instead of one [topic1,topic2,topic3]. This way if you have a sample that belongs to topic1, you'll encode this as [1 , 0 , 0], if that's topic3, your representation is [0, 0, 1] (you mark with 1 the column that corresponds to the topic)

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!