问题
I am trying to detect the outliers to my dataset and I find the sklearn's Isolation Forest. I can't understand how to work with it. I fit my training data in it and it gives me back a vector with -1 and 1 values.
Can anyone explain to me how it works and provide an example?
How can I know that the outliers are 'real' outliers?
Tuning Parameters?
Here is my code:
clf = IsolationForest(max_samples=10000, random_state=10)
clf.fit(x_train)
y_pred_train = clf.predict(x_train)
y_pred_test = clf.predict(x_test)
[1 1 1 ..., -1 1 1]
回答1:
Seems you have many questions, let me try to answer one by one with best of my knowledge. - How it works? -> It works on the fact that the nature of outliers in any data set, which is outliers are 'few and different' which is quite different from the typical clustering based or distance based algorithm. At the top level it works on the logic that outliers takes less steps to 'isolate' compare to 'normal' point in any data set. To do so, this is what IF does, Suppose you have training data set X with n data points each having m features. In training, IF creates Isolation trees (Binary search trees) for different features. For training you have 3 parameters for tuning, one is number of isolation trees ('n_estimators' in sklearn_IsolationForest), second is number of samples ('max_samples' in sklearn_IsolationForest) and the third is the number of features to draw from X to train each base estimator ('max_features' in sklearn_IF). 'max_sample' is the number of random samples it will pick from the original data set for creating Isolation trees.
During the test phase it finds the path length of data point under test from all the trained Isolation Trees and finds the average path length. Higher the path length, more normal the point and vice-versa. Based on the average path length. it calculates the anomaly score, decision_function of sklearn_IF can be used to get this. For sklearn_IF, lower the score, more anomalous the sample. Based on the anomaly score you can decide whether the given sample is anomalous or not by setting proper value of contamination in sklearn_IF object. default value of contamination is 0.1 which you can tune for deciding the threshold. The amount of contamination of the data set, i.e. the proportion of outliers in the data set.
Tuning parameters Training -> 1. n_estimators, 2. max_samples, 3.max_features. Testing -> 1. contamination
回答2:
-1 represents the outliers (according to the fitted model). See IsolationForest example for a nice depiction of the process. If you have some prior knowledge, you could provide more parameters to get a more accurate fitting. For example, if you know the contamination (proportion of outliers in the data set) you could provide it as an input. By default it is assumed to be 0.1. See description of the parameters here.
回答3:
Let me add something, which I got stucked, when I read this question.
Most of the time you are using it for binary classification (I would assume), where you have a majority class 0 and an outlier class 1. For exmaple if you want to detect fraud then your major class is non-fraud (0) and fraud is (1).
Now if you have a train and test split: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
and you run:
clf = IsolationForest(max_samples=10000, random_state=10)
clf.fit(x_train)
y_pred_test = clf.predict(x_test)
The output for "normal" classifier scoring can be quite confusiong. As already mentioned the y_pred_test
will consists of [-1,1], where 1 is your majority class 0 and -1 is your minor class 1. So I can recommend you to convert it:
y_pred_test = np.where(y_pred_test == 1, 0, 1)
Then you can use your normal scoring funtions etc.
来源:https://stackoverflow.com/questions/43063031/how-to-use-isolation-forest