text classifier with bag of words and additional sentiment feature in sklearn

问题

I am trying to build a classifier that in addition to bag of words uses features like the sentiment or a topic (LDA result). I have a pandas DataFrame with the text and the label and would like to add a sentiment value (numerical between -5 and 5) and the result of LDA analysis (a string with the topic of the sentence).

I have a working bag of words classifier that uses CountVectorizer from sklearn and performs the classification with MultinomialNaiveBayes.

df = pd.DataFrame.from_records(data=data, columns=names)
train, test = train_test_split(
    df,
    train_size=train_ratio,
    random_state=1337
)
train_df = pd.DataFrame(train, columns=names)
test_df = pd.DataFrame(test, columns=names)
vectorizer = CountVectorizer()
train_matrix = vectorizer.fit_transform(train_df['text'])
test_matrix = vectorizer.transform(test_df['text'])
positive_cases_train = (train_df['label'] == 'decision')
positive_cases_test = (test_df['label'] == 'decision')
classifier = MultinomialNB()
classifier.fit(train_matrix, positive_cases_train)

The question is now. How can I additionally to the bag of words technique introduce the other features to my classifier?

Thanks in advance and if you need more information I am glad to provide those.

Edit: After adding the rows like suggested by @Guiem a new question regarding weight of the new feature. This Edit adds to that new question:

The shape of my train matrix is (2554, 5286). The weird thing though is that it is this shape with and without the sentiment column added (Maybe the row is not added properly?)

If I print the Matrix I get the following output:

  (0, 322)  0.0917594575712
  (0, 544)  0.196910480455
  (0, 556)  0.235630958238
  (0, 706)  0.137241420774
  (0, 1080) 0.211125349374
  (0, 1404) 0.216326271935
  (0, 1412) 0.191757369869
  (0, 2175) 0.128800602511
  (0, 2176) 0.271268708356
  (0, 2371) 0.123979845513
  (0, 2523) 0.406583720526
  (0, 3328) 0.278476810585
  (0, 3752) 0.203741786877
  (0, 3847) 0.301505063552
  (0, 4098) 0.213653538407
  (0, 4664) 0.0753937554096
  (0, 4676) 0.164498844366
  (0, 4738) 0.0844966331512
  (0, 4814) 0.251572721805
  (0, 5013) 0.201686066537
  (0, 5128) 0.21174469759
  (0, 5135) 0.187485844479
  (1, 291)  0.227264696182
  (1, 322)  0.0718526940442
  (1, 398)  0.118905396285
  : :
  (2553, 3165)  0.0985290985889
  (2553, 3172)  0.134514497354
  (2553, 3217)  0.0716087169489
  (2553, 3241)  0.172404983302
  (2553, 3342)  0.145912701013
  (2553, 3498)  0.149172538211
  (2553, 3772)  0.140598133976
  (2553, 4308)  0.0704700896603
  (2553, 4323)  0.0800039075449
  (2553, 4505)  0.163830579067
  (2553, 4663)  0.0513678549359
  (2553, 4664)  0.0681930862174
  (2553, 4738)  0.114639856277
  (2553, 4855)  0.140598133976
  (2553, 4942)  0.138370066422
  (2553, 4967)  0.143088901589
  (2553, 5001)  0.185244190321
  (2553, 5008)  0.0876615764151
  (2553, 5010)  0.108531807984
  (2553, 5053)  0.136354534152
  (2553, 5104)  0.0928665728295
  (2553, 5148)  0.171292088292
  (2553, 5152)  0.172404983302
  (2553, 5191)  0.104762377866
  (2553, 5265)  0.123712025565

I hope that helps a little or did you want some other information?

回答1:

One option would be to just add these two new features to your CountVectorizer matrix as columns.

As you are not performing any tf-idf, your count matrix is going to be filled with integers so you could encode your new columns as int values.

You might have to try several encodings but you can start with something like:

sentiment [-5,...,5] transformed to [0,...,10]
string with topic of sentence. Just assign integers to different topics ({'unicorns':0, 'batman':1, ...}), you can keep a dictionary structure to assign integers and avoid repeating topics.

And just in case you don't know how to add columns to your train_matrix:

dense_matrix = train_matrix.todense() # countvectorizer returns a sparse matrix
np.insert(dense_matrix,dense_matrix.shape[1],[val1,...,valN],axis=1)

note that the column [val1,...,valN] needs to have the same lenght as num. samples you are using

Even though it won't be strictly a Bag of Words anymore (because not all columns represent word frequency), just adding this two columns will add up the extra information you want to include. And naive Bayes classifier considers each of the features to contribute independently to the probability, so we are okay here.

Update: better use a 'one hot' encoder to encode categorical features (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). This way you prevent weird behavior by assigning integer values to your new features (maybe you can still do that with sentiment, because in a scale of sentiment from 0 to 10 you assume that a 9 sentiment is closer to a sample with sentiment 10 rather than another with sentiment 0). But with categorical features you better do the one-hot encoding. So let's say you have 3 topics, then you can use same technique of adding columns only now you have to add 3 instead of one [topic1,topic2,topic3]. This way if you have a sample that belongs to topic1, you'll encode this as [1 , 0 , 0], if that's topic3, your representation is [0, 0, 1] (you mark with 1 the column that corresponds to the topic)

来源：https://stackoverflow.com/questions/35254526/text-classifier-with-bag-of-words-and-additional-sentiment-feature-in-sklearn

标签

python

scikit-learn

text-classification