Problem with CountVectorizer from scikit-learn package

问题

I have a dataset of movie reviews. It has two columns: 'class' and 'reviews'. I have done most of the routine preprocessing stuff, such as: lowering the characters, removing stop words, removing punctuation marks. At the end of preprocessing, each original review looks like words separated by space delimiter.

I want to use CountVectorizer and then TF-IDF in order to create features of my dataset so i can do classification/text recognition with Random Forest. I looked into websites and i tried to do how they did. This is my code:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer()
new_X = vectorizer.fit_transform(X)
tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
print(X1)

But, i get this output...

(0, 0)  1.0

which doesn't make sense at all. I tackled with some parameters and commented out the parts about TF-IDF. Here's my code:

data = pd.read_csv('updated-data ready.csv')
X = data.drop('class', axis = 1)
y = data['class']
vectorizer = CountVectorizer(analyzer = 'char_wb',  \
                         tokenizer = None, \
                         preprocessor = None, \
                         stop_words = None, \
                         max_features = 5000)

new_X = vectorizer.fit_transform(X)
print(new_X)

and this is my output:

(0, 4)  1
(0, 6)  1
(0, 2)  1
(0, 5)  1
(0, 1)  2
(0, 3)  1
(0, 0)  2

Am i missing something? Or am i too noob to understand? All i understood and want was/is if i do transform, i will receive a new dataset with so many features (regarding the words and their frequencies) plus label column. But, what i am getting is so far from it.

I repeat, all i want is to have a new dataset out of my dataset with reviews in which it has numbers, words as features, so Random Forest or other classification algorithms can do anything with it.

Thanks.

Btw, this is first five rows of my dataset:

   class                                            reviews
0      1                         da vinci code book awesome
1      1  first clive cussler ever read even books like ...
2      1                            liked da vinci code lot
3      1                            liked da vinci code lot
4      1            liked da vinci code ultimatly seem hold

回答1:

Suppose you happen to have a dataframe:

data
    class   reviews
0   1   da vinci code book aw...
1   1   first clive cussler ever read even books lik...
2   1   liked da vinci cod...
3   1   liked da vinci cod...
4   1   liked da vinci code ultimatly seem...

Separate into features and outcomes:

y = data['class']
X = data.drop('class', axis = 1)

Then, following your pipeline, you can prepare your data for any ML algo like this:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
new_X = vectorizer.fit_transform(X.reviews)
new_X
<5x18 sparse matrix of type '<class 'numpy.int64'>'

This new_X can be used in your further pipeline "as is" or converted to dense matrix:

new_X.todense()
matrix([[1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1],
        [0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1]],
       dtype=int64)
        with 30 stored elements in Compressed Sparse Row format>

Rows in this matrix represent rows in the original reviews column and columns represent counts of words. In case you're interested in what column refers to what word you may do:

vectorizer.vocabulary_
{'da': 6,
 'vinci': 17,
 'code': 4,
 'book': 1,
 'awesome': 0,
 'first': 9,
 'clive': 3,
 'cussler': 5,
....

where key is a word and value is column index in the above matrix (you may infer, actually, that column index correspond to ordered vocabulary, with 'awesome' responsible for 0th column and so on).

You may further proceed with your pipeline like this:

tfidfconverter = TfidfTransformer()  
X1 = tfidfconverter.fit_transform(new_X)
X1
<5x18 sparse matrix of type '<class 'numpy.float64'>'
    with 30 stored elements in Compressed Sparse Row format>

Finally, you can feed your preprocessed data into RandomForest:

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X1, y)

This code runs without error on my notebook. Please, let us know if this solves your problem!

来源：https://stackoverflow.com/questions/54176657/problem-with-countvectorizer-from-scikit-learn-package

标签

python

scikit-learn

classification

sentiment-analysis

text-recognition