dimension mismatch error in CountVectorizer MultinomialNB

有些话、适合烂在心里 提交于 2019-12-02 04:17:23

问题


Before I lodge this question, I have to say I've thoroughly read more than 15 similar topics on this board, each with somehow different recommendations, but all of them just could not get me right.

Ok, so I split my 'spam email' text data (originally in csv format) into training and test sets, using CountVectorizer and its 'fit_transform' function to fit the vocabulary of the corpus and extracts word count features from text. And then I applied MultinomialNB() to learn from training set and predict on test set. Here is my code (simplified):

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB

# loading data 
# data contains two columns ('text', 'target')

spam = pd.read_csv('spam.csv')
spam['target'] = np.where(spam_data['target']=='spam',1,0)

# split data
X_train, X_test, y_train, y_test = train_test_split(spam_data['text'], spam_data['target'], random_state=0) 

# fit vocabulary and extract word count features
cv = CountVectorizer()
X_traincv = cv.fit_transform(X_train)  
X_testcv = cv.fit_transform(X_test)

# learn and predict using MultinomialNB
clfNB = MultinomialNB(alpha=0.1)
clfNB.fit(X_traincv, y_train)

# so far so good, but when I predict on X_testcv
y_pred = algo.predict(X_testcv)

# Python throws me an error: dimension mismatch

The suggestions I gleaned from previous question threads are to (1) use only .transform() on X_test, or (2) ascertain if each row in the original spam data is on string format (yes, they are), or (3) do nothing on X_test. But all of them didn't ring the bell and Python kept giving me 'dimension mismatch' error. After struggling for 4 hours, I had to succumb to Stackoverflow. It will be truly appreciated if anyone could enlighten me on this. Just want to know what goes wrong with my code and how to get the dimension right.

Thank you.

Btw, the original data entries look like this

_

                                         test   target
0 Go until jurong point, crazy.. Available only    0
1 Ok lar... Joking wif u oni...                    0
2 Free entry in 2 a wkly comp to win FA Cup fina   1
3 U dun say so early hor... U c already then say   0
4 Nah I don't think he goes to usf, he lives aro   0
5 FreeMsg Hey there darling it's been 3 week's n   1
6 WINNER!! As a valued network customer you have   1

回答1:


Your CountVectorizer has already been fitted with the training data. So for your test data, you just want to call transform(), not fit_transform().

Otherwise, if you use fit_transform() again on your test data, you get different columns based on the unique vocabulary of the test data. So just fit once for training.

X_testcv = cv.transform(X_test)


来源:https://stackoverflow.com/questions/45804133/dimension-mismatch-error-in-countvectorizer-multinomialnb

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!