问题
I'm trying to train an SVM classifier on big number of items and classes, which becomes really, really slow.
First of all, I've extracted a feature set from my data, to be specific 512 features overall and put it in numpy array. There are 13k items in this array. It looks like that:
>>print(type(X_train))
<class 'numpy.ndarray'>
>>print(X_train)
[[ 0.01988654 -0.02607637 0.04691431 ... 0.11521499 0.03433102
0.01791015]
[-0.00058317 0.05720023 0.03854145 ... 0.07057668 0.09192026
0.01479562]
[ 0.01506544 0.05616265 0.01514515 ... 0.04981219 0.05810429
0.00232013]
...
Also, there are ~4k of different classes:
>> print(type(labels))
<class 'list'>
>> print(labels)
[0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 5, 5, ... ]
And here is the classifier:
import pickle
from thundersvmScikit import SVC
FILENAME = 'dataset.pickle'
with open(FILENAME, 'rb') as infile:
(X_train, labels) = pickle.load(infile)
clf = SVC(kernel='linear', probability=True)
clf.fit(X_train, labels)
After ~90 hours has passed (and I'm using GPU implementation of sci-learn kit in a form of thundersvm) fit operation is still running. Taking into account that it is a pretty small dataset in my case I definitely need something more efficient, but I don't seem to have any good success with that. For example, I've tried this type of Keras model:
model = Sequential()
model.add(Dense(input_dim=512, units=100, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(units=n_classes, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
model.fit(X_train, labels, epochs=500, batch_size=64, validation_split=0.1, shuffle=True)
I end up with pretty good accuracy during the training stage:
Epoch 500/500
11988/11988 [==============================] - 1s 111us/step - loss: 2.1398 - acc: 0.8972 - val_loss: 9.5077 - val_acc: 0.0000e+00
However, during the actual testing even on the data that was present in the training dataset I got extremely low accuracy, predicting basically random classes:
Predictions (best probabilities):
0 class710015: 0.008
1 class715573: 0.007
2 class726619: 0.006
3 class726619: 0.010
4 class720439: 0.007
Accuracy: 0.000
Could you, please, point me in the right direction with this? Should I adjust SVM approach somehow or should I switch to custom Keras model for this type of a problem? If yes, what is the possible problem with my model?
Thanks a lot.
回答1:
You should NOT use that SVC implementation if it relies on the scikit-learn implementation of multiclass SVC. In the documentation it states "The multiclass support is handled according to a one-vs-one scheme." Meaning you train one classifier for every pair of classes, i.e. ~ 2^4k classifiers are being trained. You could use anything listed here under "Inherently multiclass"
Also your Keras implementation probably needs another layer. I'm assuming the output layer has 1 neuron for each class, in which case you'd want to use categorical crossentropy and softmax activation, as well as one hot encoding.
I'm assuming right now that all of your examples only have one class label.
回答2:
SVM is most natural for binary classification. For multiclass, scikit-learn uses one-versus-one to combine O(K^2) binary classifiers (https://scikit-learn.org/stable/modules/svm.html), with K the number of classes. So, the running time is proportional to K^2, or in your case, 16 million. This is the reason why it is so slow.
You should either reduce the number of classes, or switch to other models such as neural networks or decision trees.
P.S: scikit-learn also has one-vs-all approach for SVM (https://scikit-learn.org/stable/modules/multiclass.html), which is O(K). You could also try this.
来源:https://stackoverflow.com/questions/54522208/svm-is-very-slow-when-training-classifier-on-big-number-of-classes