Why does the standardscaler have different effects under different number of features

冷暖自知 提交于 2021-02-16 15:16:38

问题


I experimented with breast cancer data from scikit-learn.

  1. Use all features and not use standardscaler:

    cancer = datasets.load_breast_cancer()
    x = cancer.data
    y = cancer.target
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    
    pla = Perceptron().fit(x_train, y_train)
    y_pred = pla.predict(x_test)
    print(accuracy_score(y_test, y_pred))
    
    • result 1 : 0.9473684210526315
  2. Use all features and use standardscaler:

    cancer = datasets.load_breast_cancer()
    x = cancer.data
    y = cancer.target
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    
    sc=StandardScaler()
    sc.fit(x_train)
    x_train=sc.transform(x_train)
    x_test=sc.transform(x_test)
    
    pla = Perceptron().fit(x_train, y_train)
    y_pred = pla.predict(x_test)
    print(accuracy_score(y_test, y_pred))
    
    • result 2 : 0.9736842105263158
  3. Use only two features and not use standardscaler:

    cancer = datasets.load_breast_cancer()
    x = cancer.data[:,[27,22]]
    y = cancer.target
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    
    pla = Perceptron().fit(x_train, y_train)
    y_pred = pla.predict(x_test)
    print(accuracy_score(y_test, y_pred))
    
    • result 3 : 0.37719298245614036
  4. Use only two features and use standardscaler:

    cancer = datasets.load_breast_cancer()
    x = cancer.data[:,[27,22]]
    y = cancer.target
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)
    
    sc=StandardScaler()
    sc.fit(x_train)
    x_train=sc.transform(x_train)
    x_test=sc.transform(x_test)
    
    pla = Perceptron().fit(x_train, y_train)
    y_pred = pla.predict(x_test)
    print(accuracy_score(y_test, y_pred))
    
    • result 4 : 0.9824561403508771

As result1, result2, result3, result4 show , accuracy has much improvement with Standardscaler while training with fewer features.

So I wondering Why does the standardscaler have different effects under different number of features?

PS. Here is the two featrues I choose:


回答1:


TL;DR

Don't do feature selection as long as you do not understand fully why you're doing it and in which way it may assist your algo in learning and generalizing better. For starter, please read http://www.feat.engineering/selection.html by Max Kuhn

Full read.

I suspect you tried to select a best feature subset and encountered a situation where a [arbitrary] subset performed better than the whole dataset. StandardScaling is out of question here because it's considered a standard preprocessing procedure for the algo of yours. So your real question should be "Why a subset of features perform better than a full dataset?"

Why your selection algo is arbitrary? 2 reasons.

First. Nobody has proven most linearly correlated feature would improve your [or any other if you wish] algo. Second. The best feature subset is different from what is necessitated by best correlated features.

Let's see this with code.

A feature subset giving best accuracy (note a)

Lets do a brute force.

acc_bench = 0.9736842105263158 # accuracy on all features
res = {}
f = x_train.shape[1]
pcpt = Perceptron(n_jobs=-1)
from itertools import combinations
for i in tqdm(range(2,10)):
    features_list = combinations(range(f),i)
    for features in features_list:
        pcpt.fit(x_train[:,features],y_train)
        preds = pcpt.predict(x_test[:, features])
        acc = accuracy_score(y_test, preds)
        if acc > acc_bench:
            acc_bench = acc
            res["accuracy"] = acc_bench
            res["features"] = features
print(res)
{'accuracy': 1.0, 'features': (0, 15, 22)}

So you see, that features [0,15,22] give perfect accuracy over validation dataset.

Do best features have anything to do with correlation to target?

Let's find a list orderd by a degree of linear correlation.

featrues = pd.DataFrame(cancer.data, columns=cancer.feature_names) 
target = pd.DataFrame(cancer.target, columns=['target']) 
cancer_data = pd.concat([featrues,target], axis=1) 
features_list = np.argsort(np.abs(cancer_data.corr()['target'])[:-1].values)[::-1]
feature_list
array([27, 22,  7, 20,  2, 23,  0,  3,  6, 26,  5, 25, 10, 12, 13, 21, 24,
       28,  1, 17,  4,  8, 29, 15, 16, 19, 14,  9, 11, 18])

You see, that best feature subset found by brute force has nothing to do with correlation.

Can linear correlation explain accuracy of Perceptron?

Let's try plotting num of feature from the above list (starting with 2 most correlated) vs resulting accuracy.

res = dict()
for i in tqdm(range(2,10)):
    features=features_list[:i]
    pcpt.fit(x_train[:,features],y_train)
    preds = pcpt.predict(x_test[:, features])
    acc = accuracy_score(y_test, preds)
    res[i]=[acc]
pd.DataFrame(res).T.plot.bar()
plt.ylim([.9,1])

Once again, linear correlated features have nothing to do with perceptron accuracy.

Conclusion.

Don't select feature prior to any algo unless you're perfectly sure what you're doing and what would be the effects of doing this. Do not mix up diffrent selection and learning algos because different algos have diferent opinions of what is important and what is not. A feature unimportant for one algo may become important for another. This is especially true for linear vs nonlinear algos.

If you want to improve accuracy of your algo do data cleaning or feature engineering instead.



来源:https://stackoverflow.com/questions/64449113/why-does-the-standardscaler-have-different-effects-under-different-number-of-fea

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!