问题

After training a classifier, I tried passing a few sentences to check if it is going to classify it correctly.

During that testing the results are not appearing well.

I suppose some variables are not correct.

Explanation

I have a dataframe called df that looks like this:

                                              news        type
0   From: mathew <mathew@mantis.co.uk>\n Subject: ...   alt.atheism
1   From: mathew <mathew@mantis.co.uk>\n Subject: ...   alt.space
2   From: I3150101@dbstu1.rz.tu-bs.de (Benedikt Ro...   alt.tech
                                                            ...
#each row in the news column is a document
#each row in the type column is the category of that document

Preprocessing:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
from sklearn import metrics

vectorizer = TfidfVectorizer( stop_words = 'english')

vectors = vectorizer.fit_transform(df.news)
clf =  SVC(C=10,gamma=1,kernel='rbf')

clf.fit(vectors, df.type)
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)

Attempt to check how some sentences are classified

texts = ["The space shuttle is made in 2018", 
         "stars are shining",
         "galaxy"]
text_features = vectorizer.transform(texts)
predictions = clf.predict(text_features)
for text, predicted in zip(texts, predictions):
   print('"{}"'.format(text))
   print("  - Predicted as: '{}'".format(df.type[pred]))

   print("")

The problem is that it returns this:

"The space shuttle is made in 2018"
  - Predicted as: 'alt.atheism    NaN
alt.atheism    NaN
alt.atheism    NaN
alt.atheism    NaN
alt.atheism    NaN

What do you think?

EDIT

Example

This is kind of how it should look like :

>>> docs_new = ['God is love', 'OpenGL on the GPU is fast']
>>> X_new_counts = count_vect.transform(docs_new)
>>> X_new_tfidf = tfidf_transformer.transform(X_new_counts)

>>> predicted = clf.predict(X_new_tfidf)

>>> for doc, category in zip(docs_new, predicted):
...     print('%r => %s' % (doc, twenty_train.target_names[category]))
...
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics

回答1:

As you mentioned in the comments, you have around 700 samples. To test how good your classifier works, you should always split your data into training and test samples. For example 500 sample as training data and 200 to test your classifier. You should then only use your training samples for training and your test samples for testing. Test data created by hand as you did are not necessarily meaningful. sklearn comes with a handy function to separate data into test and training:

#separate training and test data, 20% og your data is selected as test data
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, test_size=0.2)

vectors = vectorizer.fit_transform(df_train.news)
clf =  SVC(C=10,gamma=1,kernel='rbf')
#train classifier
clf.fit(vectors, df_train.type)

#test classifier on the test set
vectors_test = vectorizer.transform(df_test.news)
pred = clf.predict(vectors_test)
#prints accuracy of your classifier
from sklearn.metrics import classification_report
classification_report(df_test.type, pred)

This will give you a hint how good your classifier actually is. If you think it is not good enough, you should try another classifier, for example logistic regression. Or you could change your data to all lower case letters and see if this helps to augment your accuracy.

Edit: You can also write your predictions back to your test_datframe:

df_test['Predicted'] = preds
df_test.head()

This will help you to see a pattern. Is acctually all predicted as alt.atheism as your example suggests?

回答2:

The data with which you train your classifier is significantly different to the phrases you test it on. As you mentioned in your comment on my first answer, you get an accuracy of more than 90%, which is pretty good. But you tought your classifier to classify mailing list items which are long documents with e-mail adresses in them. Your phrases such as "The space shuttle is made in 2018" are pretty short and do not contain e-mail adresses. Its possible that your classifier uses those e-mail adresses to classify the documents, which explaines the good results. You can test if that is really the case if you remove the e-mail adresses from the data before training.

来源：https://stackoverflow.com/questions/58375486/check-skills-of-a-classifier-in-scikit-learn

标签

python