问题
I have build a small program that creates a classifier for a given dataset with scikit-learn. Now I wanted to try this example, to see the classifier at work. For example the clf has to detect "cats".
This is how I go on:
I have 50 pictures of Cats and 50 pictures of "none cats".
- get descriptors for
data_set
with sift-feature detector - Split data into training set and test set (25 pictures cats + 25 pictures non cats = training_set, test_set same)
- get cluster centers with kmeans from the
training_set
- create histogramm data of the
training_set
antest_set
by using the cluster centers try this code from scikit-learn:
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]}, {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}] scores = ['precision', 'recall'] for score in scores: print("# Tuning hyper-parameters for %s" % score) print() clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score) clf.fit(X_train, y_train) print("Best parameters set found on development set:") print() print(clf.best_estimator_) print() print("Grid scores on development set:") print() for params, mean_score, scores in clf.grid_scores_: print("%0.3f (+/-%0.03f) for %r" % (mean_score, scores.std() / 2, params)) print() print("Detailed classification report:") print() print("The model is trained on the full development set.") print("The scores are computed on the full evaluation set.") print() y_true, y_pred = y_test, clf.predict(X_test) print y_true print y_pred print(classification_report(y_true, y_pred)) print() print clf.score(X_train, y_train) print "score" print clf.best_params_ print "best_params" pred = clf.predict(X_test) print accuracy_score(y_test, pred) print "accuracy_score"
and I get that result:
# Tuning hyper-parameters for recall
()
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 0.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 0.].
average=average)
/usr/local/lib/python2.7/dist-packages/sklearn/metrics/metrics.py:1760: UserWarning: The sum of true positives and false positives are equal to zero for some labels. Precision is ill defined for those labels [ 1.]. The precision and recall are equal to zero for some labels. fbeta_score is ill defined for those labels [ 1.].
average=average)
Best parameters set found on development set:
()
SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=0.001, kernel=rbf, max_iter=-1, probability=False,
random_state=None, shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.01, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 0.10000000000000001, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 10.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 100.0, 'gamma': 0.0001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.001}
0.800 (+/-0.200) for {'kernel': 'rbf', 'C': 1000.0, 'gamma': 0.0001}
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.]
precision recall f1-score support
0.0 1.00 0.04 0.08 25
1.0 0.51 1.00 0.68 25
avg / total 0.76 0.52 0.38 50
()
0.52
score
{'kernel': 'rbf', 'C': 0.001, 'gamma': 0.001}
best_params
0.52
accuracy_score
seems to be that the clf says to all thinks its a cat....but why?
Is the data_set
to small to get a good result ?
Edit: I'm using VLFeat to detecting sift descriptor
Functions:
def create_descriptor_data(data, ID):
descriptor_list = []
datas = numpy.genfromtxt(data,dtype='str')
for p in datas:
locs, desc = vlfeat_module.vlf_create_descriptors(p,str(ID)+'.key',ID) # create descriptors and save descs in file
if len(desc) > 500:
desc = desc[::round((len(desc))/400, 1)] # take between 400 - 800 descriptors
descriptor_list.append(desc)
ID += 1 # ID for filename
return descriptor_list
# create k-mean centers from all *.txt files in directory (data)
def create_center_data(data):
#data = numpy.vstack(data)
n_clusters = len(numpy.unique(data))
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=1)
kmeans.fit(data)
return kmeans, n_clusters
def create_histogram_data(kmeans, descs, n_clusters):
histogram_list = []
# load from each file data
for desc in descs:
length = len(desc)
# create histogram from descriptors
histogram = kmeans.predict(desc)
histogram = numpy.bincount(histogram, minlength=n_clusters) #minlength = k in k-means
histogram = numpy.divide(histogram, length, dtype='float')
histogram_list.append(histogram)
histogram = numpy.vstack(histogram_list)
return histogram
and the call:
X_desc_pos = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_pos.txt",0) # create desc from dataset_pos, 25 pics
X_desc_neg = lib.dataset_module.create_descriptor_data("./static/picture_set/dataset_neg.txt",51) # create desc from dataset_neg, 25 pics
X_train_pos, X_test_pos = train_test_split(X_desc_pos, test_size=0.5)
X_train_neg, X_test_neg = train_test_split(X_desc_neg, test_size=0.5)
x1 = numpy.vstack(X_train_pos)
x2 = numpy.vstack(X_train_neg)
kmeans, n_clusters = lib.dataset_module.create_center_data(numpy.vstack((x1,x2)))
X_train_pos = lib.dataset_module.create_histogram_data(kmeans, X_train_pos, n_clusters)
X_train_neg = lib.dataset_module.create_histogram_data(kmeans, X_train_neg, n_clusters)
X_train = numpy.vstack([X_train_pos, X_train_neg])
y_train = numpy.hstack([numpy.ones(len(X_train_pos)), numpy.zeros(len(X_train_neg))])
X_test_pos = lib.dataset_module.create_histogram_data(kmeans, X_test_pos, n_clusters)
X_test_neg = lib.dataset_module.create_histogram_data(kmeans, X_test_neg, n_clusters)
X_test = numpy.vstack([X_test_pos, X_test_neg])
y_test = numpy.hstack([numpy.ones(len(X_test_pos)), numpy.zeros(len(X_test_neg))])
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5, scoring=score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_estimator_)
print()
print("Grid scores on development set:")
print()
for params, mean_score, scores in clf.grid_scores_:
print("%0.3f (+/-%0.03f) for %r"
% (mean_score, scores.std() / 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print y_true
print y_pred
print(classification_report(y_true, y_pred))
print()
print clf.score(X_train, y_train)
print "score"
print clf.best_params_
print "best_params"
pred = clf.predict(X_test)
print accuracy_score(y_test, pred)
print "accuracy_score"
EDIT: Some changes by updating the range and savae again the "accuracy"
# Tuning hyper-parameters for accuracy
()
Best parameters set found on development set:
()
SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0, degree=3,
gamma=1.0, kernel=rbf, max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
()
Grid scores on development set:
()
...
()
Detailed classification report:
()
The model is trained on the full development set.
The scores are computed on the full evaluation set.
()
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[ 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 0. 1. 1. 1.
1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.
0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
precision recall f1-score support
0.0 0.88 0.92 0.90 25
1.0 0.92 0.88 0.90 25
avg / total 0.90 0.90 0.90 50
()
1.0
score
{'kernel': 'rbf', 'C': 1000.0, 'gamma': 1.0}
best_params
0.9
accuracy_score
but by testing it on a picture with
rslt = clf.predict(test_histogram)
he's still saying to a sofa: "you're a cat" :D
回答1:
There are many possibilities of such behaviour:
- There is an error in creation of the training/testing data [implementation error]
- Training set of 20 element (25 vectors with 5 cross validation leaves 20 for trianing) can be too small for a good generalization [under fitting]
- range of checked
C
andgamma
parameters can be too narrow - this variables are highly data dependent, your representations' values can require completely differentC
's andgamma
's then those currently used [under/over fitting]
My personal guess (as without the data is hard to reproduce the issue) here is the third option - bad C
and gamma
parameters to find a good model.
EDIT
You should try much bigger ranges of values, eg.
C
between10^-5
and10^15
gamma
between10^-14
and10^2
C=[] gamma=[] for i in range(21): C.append(10.0**(i-5)) for i in range(17): gamma.append(10**(i-14))
EDIT2
Once parameters' ranges are corrected, now you should perform the actual "case study". Gather more images, analyze your data representation (is histogram really enough for this task?), process your data (is it already normalized? Maybe try some decorrelation?), consider using simplier kernels - rbf can be very deceptive - on one hand it can get great scores during training, but on the other - fail completely during testing. This is a result of its overfitting capabilities (as for any consistent data set RBF-SVM can achieve 100% score during training), so finding a balance between a model's power and generalization abilities is a hard problem. This is when actual "machine learning journey" begins, have fun!
回答2:
seems to be that the clf says to all thinks its a cat....but why?
It's a bit hard to tell from your pasted output, but it seems this is the second iteration of the loop over scores = ['precision', 'recall']
, so you're optimizing for recall. That concurs with the classification report, which states that recall is 1.00
(perfect) for the positive class.
When is recall perfect? Well, when there are no false negatives, no cats staying undetected. The easy way to obtain perfect recall is therefore to predict "cat" for every input picture, regardless of whether it's a cat, and GridSearchCV
found a classifier that does exactly that.
A similar thing can happen when you optimize for precision: perfect precision can be achieved by never predicting "cat" since you'll have no false positives.
To avoid this situation, optimize for accuracy rather than precision or recall, or for Fᵦ if you have a situation with unbalanced classes.
来源:https://stackoverflow.com/questions/18210799/scikit-learn-sample-try-out-with-my-classifier-and-data