Problem with CountVectorizer from scikit-learn package
问题 I have a dataset of movie reviews. It has two columns: 'class' and 'reviews' . I have done most of the routine preprocessing stuff, such as: lowering the characters, removing stop words, removing punctuation marks. At the end of preprocessing, each original review looks like words separated by space delimiter. I want to use CountVectorizer and then TF-IDF in order to create features of my dataset so i can do classification/text recognition with Random Forest. I looked into websites and i