How to fix the false positives rate of a linear SVM?

前端未结

关注

 2  1424

I am an SVM newbie and this is my use case: I have a lot of unbalanced data to be binary classified using a linear SVM. I need to fix the false positives rate at certain values

相关标签:

2条回答

醉梦人生

2021-02-20 18:01

The predict method for LinearSVC in sklearn looks like this

def predict(self, X):
    """Predict class labels for samples in X.

    Parameters
    ----------
    X : {array-like, sparse matrix}, shape = [n_samples, n_features]
        Samples.

    Returns
    -------
    C : array, shape = [n_samples]
        Predicted class label per sample.
    """
    scores = self.decision_function(X)
    if len(scores.shape) == 1:
        indices = (scores > 0).astype(np.int)
    else:
        indices = scores.argmax(axis=1)
    return self.classes_[indices]

So in addition to what mbatchkarov suggested you can change the decisions made by the classifier (any classifier really) by changing the boundary at which the classifier says something is of one class or the other.

from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC

data = load_iris()

# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1] 
y = data.target[:100] 
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]

decision_boundary = 0
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({1: 27, 0: 23})

decision_boundary = 0.5
print Counter((clf.decision_function(X[50:]) > decision_boundary).astype(np.int8))
Counter({0: 39, 1: 11})

You can optimize the decision boundary to be anything depending on your needs.

0 讨论(0)

迷失自我

2021-02-20 18:11
The class_weights parameter allows you to push this false positive rate up or down. Let me use an everyday example to illustrate how this work. Suppose you own a night club, and you operate under two constraints:
1. You want as many people as possible to enter the club (paying customers)
2. You do not want any underage people in, as this will get you in trouble with the state
On an average day, (say) only 5% percent of the people attempting to enter the club will be underage. You are faced with a choice: being lenient or being strict. The former will boost your profits by as much as 5%, but you are running the risk of an expensive lawsuit. The latter will inevitably mean some people who are just above the legal age will be denied entry, which will cost you money too. You want to adjust the relative cost of leniency vs strictness. Note: you cannot directly control how many underage people enter the club, but you can control how strict your bouncers are.

Here is a bit of Python that shows what happens as you change the relative importance.
```
from collections import Counter
import numpy as np
from sklearn.datasets import load_iris
from sklearn.svm import LinearSVC

data = load_iris()

# remove a feature to make the problem harder
# remove the third class for simplicity
X = data.data[:100, 0:1] 
y = data.target[:100] 
# shuffle data
indices = np.arange(y.shape[0])
np.random.shuffle(indices)
X = X[indices, :]
y = y[indices]

for i in range(1, 20):
    clf = LinearSVC(class_weight={0: 1, 1: i})
    clf = clf.fit(X[:50, :], y[:50])
    print i, Counter(clf.predict(X[50:]))
    # print clf.decision_function(X[50:])
```
Which outputs
```
1 Counter({1: 22, 0: 28})
2 Counter({1: 31, 0: 19})
3 Counter({1: 39, 0: 11})
4 Counter({1: 43, 0: 7})
5 Counter({1: 43, 0: 7})
6 Counter({1: 44, 0: 6})
7 Counter({1: 44, 0: 6})
8 Counter({1: 44, 0: 6})
9 Counter({1: 47, 0: 3})
10 Counter({1: 47, 0: 3})
11 Counter({1: 47, 0: 3})
12 Counter({1: 47, 0: 3})
13 Counter({1: 47, 0: 3})
14 Counter({1: 47, 0: 3})
15 Counter({1: 47, 0: 3})
16 Counter({1: 47, 0: 3})
17 Counter({1: 48, 0: 2})
18 Counter({1: 48, 0: 2})
19 Counter({1: 48, 0: 2})
```
Note how the number of data points classified as 0 decreases are the relative weight of class 1 increases. Assuming you have the computational resources and time to train and evaluate 10 classifiers, you can plot the precision and recall of each one and get a figure like the one below (shamelessly stolen off the internet). You can then use that to decide what the right value of class_weights is for your use case.
0 讨论(0)
发布评论:

提交评论
- 加载中...