From featurers to words python (“reverse” bag of words)

给你一囗甜甜゛ 提交于 2019-12-14 03:50:06

问题


Using sklearn I've created a BOW with 200 features in Python, which are easily extracted. But, how can I reverse it? That is, go from a vector with 200 0's or 1's to the corresponding words? Since the vocabulary is a dictionary, thus not ordered, I am not sure which word each element in the feature list corresponds to. Also, if the first element in my 200 dimensional vector corresponds to the first word in the dictionary, how do I then extract a word from the dictionary via index?

The BOW is created this way

vec = CountVectorizer(stop_words = sw, strip_accents="unicode", analyzer = "word", max_features = 200)
features = vec.fit_transform(data.loc[:,"description"]).todense()

thus "features" is a matrix (n,200) matrix (n being the number of sentence).


回答1:


I'm not totally sure what you're going for, but it seems like you're just trying to figure out which column represents which word. For this, there is the handy get_feature_names argument.

Let's take a look with the example corpus provided in the docs:

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?' ]

# Put into a dataframe
data = pd.DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
                             description
0            This is the first document.
1  This document is the second document.
2             And this is the third one.
3            Is this the first document?

# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()

# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()

>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 2, 0, 1, 0, 1, 1, 0, 1],
        [1, 0, 0, 1, 1, 0, 1, 1, 1],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

To see what column represents which word use get_feature_names:

>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

So your first column is and, second is document, and so on. For readability, you can stick this in a dataframe:

>>> pd.DataFrame(features, columns = vec.get_feature_names())
   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1


来源:https://stackoverflow.com/questions/52748426/from-featurers-to-words-python-reverse-bag-of-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!