Python-Scikit. Training and testing data using SVM

问题

I am working on training and testing of data using SVM (scikit). I am training SVM and preparing a pickle from it. Then, I am using that pickle to test my system. First I am reading the training data and testing data in variables train_data and test_data respectively.

After that, the code I am using for training is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
from sklearn.externals import joblib
joblib.dump(classifier_rbf, 'pickl/train_rbf_SVM.pkl',1)

Again while testing, I am reading the training data and testing data in variables train_data and test_data respectively. The code I am using for testing is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
from sklearn.externals import joblib
classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
prediction_rbf = classifier_rbf.predict(test_vectors)

This code is working fine and giving me correct output. My question is - is it compulsory to read training data whenever I want to do testing?

Thank you.

回答1:

In your case, yes. Because you are not saving (pickling) the tfidfVectorizer. The test data must be transformed in the exact same way as the train data is transformed to give any meanungful predictions. So, if you want to not read train data again and again, pickle the tfidfVectorizer too along with some estimator and unpicke it during testing.

Also you may want to look at the Pipeline provided in scikit-learn to combine data pre processing and estimating into one object which you can pickle and unpicke easily without having to worry about pickling and loading various parts of the training

Edit - Added code

While training for the first time, add this line to your code in the end:

joblib.dump(vectorizer, 'pickl/train_vectorizer.pkl',1)

Now when testing on the data, no need to load training data. Just load the already fitted vectorizer:

classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
vectorizer = joblib.load('pickl/train_vectorizer.pkl')

test_vectors = vectorizer.transform(test_data)
prediction_rbf = classifier_rbf.predict(test_vectors)

来源：https://stackoverflow.com/questions/42029956/python-scikit-training-and-testing-data-using-svm

标签

python

scikit-learn

svm