问题
I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier
What I have tried now is the following:
def test_tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels
(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)
I get this error:
---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True)
---> 14 y_pred = clf.predict(X_test_naive)
ValueError: dimension mismatch
The function mentioned in the error is:
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB()
clf.fit(X_train_naive, y_train_naive)
y_pred = clf.predict(X_test_naive)
return
Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier), it would be appreciated.
Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch
if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1) with test_x, test_y = test_tfidf(undersample_train, ngrams = 1) it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)
回答1:
When using transformes (TfidfVectorizer in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.
The correct way to do this in your case:
def tfidf(data, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
tfidf_vectorizer.fit(df_temp['Text'])
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = tfidf_vectorizer.transform(list_corpus)
return X, list_labels, tfidf_vectorizer
def test_tfidf(data, vectorizer, ngrams = 1):
df_temp = data.copy(deep = True)
df_temp = basic_preprocessing(df_temp)
# No need to create a new TfidfVectorizer here!
list_corpus = df_temp["Text"].tolist()
list_labels = df_temp["Label"].tolist()
X = vectorizer.transform(list_corpus)
return X, list_labels
# this method is copied from the other SO question
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
clf = MultinomialNB() # Gaussian Naive Bayes
clf.fit(X_train_naive, y_train_naive)
res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
y_pred = clf.predict(X_test_naive)
f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
pres = precision_score(y_pred, y_test_naive, average = 'weighted')
rec = recall_score(y_pred, y_test_naive, average = 'weighted')
acc = accuracy_score(y_pred, y_test_naive)
res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres,
'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)
return res
train_x, train_y, count_vectorizer = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)
full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)
来源:https://stackoverflow.com/questions/65270921/dimension-mismatch-when-i-try-to-apply-tf-idf-to-test-set