How do I properly combine numerical features with text (bag of words) in scikit-learn?

让人想犯罪 __ 提交于 2019-12-03 06:52:58

You can weight the counts by using the Tf–idf:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

np.set_printoptions(linewidth=200)

corpus = [
  'This is the first document.',
  'This is the second second document.',
  'And the third one',
  'Is this the first document?',
]

vectorizer = CountVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)

words = vectorizer.get_feature_names()
print(words)
words_counts = X.toarray()
print(words_counts)

transformer = TfidfTransformer()
tfidf = transformer.fit_transform(words_counts)
print(tfidf.toarray())

The output is this:

# words
[u'and', u'document', u'first', u'is', u'one', u'second', u'the', u'third', u'this']

# words_counts
[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]

# tfidf transformation
[[ 0.          0.43877674  0.54197657  0.43877674  0.          0.          0.35872874  0.          0.43877674]
 [ 0.          0.27230147  0.          0.27230147  0.          0.85322574  0.22262429  0.          0.27230147]
 [ 0.55280532  0.          0.          0.          0.55280532  0.          0.28847675  0.55280532  0.        ]
 [ 0.          0.43877674  0.54197657  0.43877674  0.          0.          0.35872874  0.          0.43877674]]

With this representation you should be able to merge further binary features to train a SVC.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!