Python-Scikit. Training and testing data using SVM

依然范特西╮ 提交于 2019-12-25 06:55:35

问题


I am working on training and testing of data using SVM (scikit). I am training SVM and preparing a pickle from it. Then, I am using that pickle to test my system. First I am reading the training data and testing data in variables train_data and test_data respectively.

After that, the code I am using for training is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
from sklearn.externals import joblib
joblib.dump(classifier_rbf, 'pickl/train_rbf_SVM.pkl',1)

Again while testing, I am reading the training data and testing data in variables train_data and test_data respectively. The code I am using for testing is:

vectorizer = TfidfVectorizer(max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)
from sklearn.externals import joblib
classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
prediction_rbf = classifier_rbf.predict(test_vectors)

This code is working fine and giving me correct output. My question is - is it compulsory to read training data whenever I want to do testing?

Thank you.


回答1:


In your case, yes. Because you are not saving (pickling) the tfidfVectorizer. The test data must be transformed in the exact same way as the train data is transformed to give any meanungful predictions. So, if you want to not read train data again and again, pickle the tfidfVectorizer too along with some estimator and unpicke it during testing.

Also you may want to look at the Pipeline provided in scikit-learn to combine data pre processing and estimating into one object which you can pickle and unpicke easily without having to worry about pickling and loading various parts of the training

Edit - Added code

While training for the first time, add this line to your code in the end:

joblib.dump(vectorizer, 'pickl/train_vectorizer.pkl',1)

Now when testing on the data, no need to load training data. Just load the already fitted vectorizer:

classifier_rbf = joblib.load('pickl/train_rbf_SVM.pkl')
vectorizer = joblib.load('pickl/train_vectorizer.pkl')

test_vectors = vectorizer.transform(test_data)
prediction_rbf = classifier_rbf.predict(test_vectors)


来源:https://stackoverflow.com/questions/42029956/python-scikit-training-and-testing-data-using-svm

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!