Dimension mismatch when I try to apply tf-idf to test set

巧了我就是萌 提交于 2020-12-15 04:24:48

问题


I am trying to apply a new pre-processing algorithm to my dataset, following this answer: Encoding text in ML classifier

What I have tried now is the following:

def test_tfidf(data, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
    tfidf_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels

(I would suggest to refer to the link I mentioned above for all the code). When I try to apply the latter two function to my dataset:

train_x, train_y, count_vectorizer  = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, ngrams = 1)

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y), ignore_index = True)

I get this error:

---> 12 full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, ), ignore_index = True) 
---> 14     y_pred = clf.predict(X_test_naive)

ValueError: dimension mismatch

The function mentioned in the error is:

def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() 
    clf.fit(X_train_naive, y_train_naive)
    y_pred = clf.predict(X_test_naive)
        
    return 

Any help in understanding what is wrong in my new definition and/or in applying the tf-idf to my dataset (please refer here for the relevant parts: Encoding text in ML classifier), it would be appreciated.

Update: I think this question/answer might be useful as well for helping me in figure out the issue: scikit-learn ValueError: dimension mismatch

if I replace test_x, test_y = test_tfidf(testing_set, ngrams = 1) with test_x, test_y = test_tfidf(undersample_train, ngrams = 1) it does not return any error. However, I do not think it is right, as I am getting values very very high (99% on all statistics)


回答1:


When using transformes (TfidfVectorizer in this case), you must use the same object ot transform both train and test data. The transformer is typically fitted using the training data only, and then re-used to transform the test data.

The correct way to do this in your case:

def tfidf(data, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)
    
    tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, ngrams))
    tfidf_vectorizer.fit(df_temp['Text'])

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = tfidf_vectorizer.transform(list_corpus)
    
    return X, list_labels, tfidf_vectorizer


def test_tfidf(data, vectorizer, ngrams = 1):

    df_temp = data.copy(deep = True)
    df_temp = basic_preprocessing(df_temp)

    # No need to create a new TfidfVectorizer here!

    list_corpus = df_temp["Text"].tolist()
    list_labels = df_temp["Label"].tolist()

    X = vectorizer.transform(list_corpus)
    
    return X, list_labels

# this method is copied from the other SO question
def training_naive(X_train_naive, X_test_naive, y_train_naive, y_test_naive, preproc):
    
    clf = MultinomialNB() # Gaussian Naive Bayes
    clf.fit(X_train_naive, y_train_naive)

    res = pd.DataFrame(columns = ['Preprocessing', 'Model', 'Precision', 'Recall', 'F1-score', 'Accuracy'])
    
    y_pred = clf.predict(X_test_naive)
    
    f1 = f1_score(y_pred, y_test_naive, average = 'weighted')
    pres = precision_score(y_pred, y_test_naive, average = 'weighted')
    rec = recall_score(y_pred, y_test_naive, average = 'weighted')
    acc = accuracy_score(y_pred, y_test_naive)
    
    res = res.append({'Preprocessing': preproc, 'Model': 'Naive Bayes', 'Precision': pres, 
                     'Recall': rec, 'F1-score': f1, 'Accuracy': acc}, ignore_index = True)

    return res 

train_x, train_y, count_vectorizer  = tfidf(undersample_train, ngrams = 1)
testing_set = pd.concat([X_test, y_test], axis=1)
test_x, test_y = test_tfidf(testing_set, count_vectorizer, ngrams = 1)

full_result = full_result.append(training_naive(train_x, test_x, train_y, test_y, count_vectorizer), ignore_index = True)



来源:https://stackoverflow.com/questions/65270921/dimension-mismatch-when-i-try-to-apply-tf-idf-to-test-set

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!