Scikit-learn - feature reduction using RFECV and GridSearch. Where are the coefficients stored?

前端 未结 2 1931
天命终不由人
天命终不由人 2020-12-14 03:03

I am using Scikit-learn RFECV to select most significant features for a logistic regression using a Cross Validation. Assume X is a [n,x] dataframe of features, and y repre

相关标签:
2条回答
  • 2020-12-14 03:06

    Essentially, you need to do a train-validation-test split for your sample data. Where train set is used to tuned your normal params, validation set for tuning hyperparameters in grid search, and test set for performance evaluation. Here is one way to do this.

    from sklearn.datasets import make_classification
    from sklearn.pipeline import make_pipeline
    from sklearn.grid_search import GridSearchCV
    from sklearn.cross_validation import StratifiedKFold
    from sklearn.preprocessing import StandardScaler
    from sklearn.feature_selection import RFECV
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import classification_report
    import pandas as pd
    
    
    # simulate some artifical data so that I can show you the result of each intermediate step
    # 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
    X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])
    
    X.shape
    
    Out[78]: (1000, 100)
    
    y.shape
    
    Out[79]: (1000,)
    
    # Nested Cross-Validation, this returns an train/test index interator
    split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
    # to take a look at the split, you will see it has 5 tuples
    list(split)
    # the 1st fold
    train_index = list(split)[0][0]
    
    Out[80]: array([  0,   1,   2, ..., 997, 998, 999])
    
    test_index = list(split)[0][1]
    
    Out[81]: array([  5,  12,  17, ..., 979, 982, 984])
    
    # let's play with just one iteration for now
    # your pipe
    pipe = make_pipeline(StandardScaler(), LogisticRegression())
    
    # set up params
    params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
                        logisticregression__penalty=['l1', 'l2'],
                        logisticregression__class_weight=[None, 'auto'])
    
    # apply your grid search only in train data but with a futher cv step
    # so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
    # all performance is still evaluated in a separated held-out 'test' set
    grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
    # fit the data on train set
    grid.fit(X[train_index], y[train_index])
    
    # to get the params of your estimator, call your gscv
    grid.best_estimator_
    Out[82]: 
    Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
              fit_intercept=True, intercept_scaling=1, max_iter=100,
              multi_class='ovr', penalty='l1', random_state=None,
              solver='liblinear', tol=0.0001, verbose=0))])
    
    
    # the performance in validation set
    grid.grid_scores_
    Out[83]: 
    [mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
     mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
     mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
     mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
     mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]
    
    
    # or the best score among them
    grid.best_score_
    Out[84]: 0.94002188482393367
    
    # now after finishing training the estimator, we now predict in test set
    y_pred = grid.predict(X[test_index])
    # since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
    y_pred_probs = grid.predict_proba(X[test_index])
    
    Out[87]: 
    array([[ 0.0632,  0.9368],
           [ 0.0236,  0.9764],
           [ 0.0227,  0.9773],
           ..., 
           [ 0.0108,  0.9892],
           [ 0.2903,  0.7097],
           [ 0.0113,  0.9887]])
    
    # to get evaluation result, 
    print(classification_report(y[test_index], y_pred))
    
                 precision    recall  f1-score   support
    
              0       0.93      0.59      0.72        22
              1       0.95      0.99      0.97       179
    
    avg / total       0.95      0.95      0.95       201
    
    
    
    # to put all things together with the nested cross-validation
    # generate a pandas dataframe to store prediction probability
    kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
    report = []  # to store classificaiton report
    
    split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
    
    for train_index, test_index in split:
    
        grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
    
        grid.fit(X[train_index], y[train_index])
    
        y_pred_probs = grid.predict_proba(X[test_index])
        kfold_df.iloc[test_index, :] = y_pred_probs
    
        y_pred = grid.predict(X[test_index])
        report.append(classification_report(y[test_index], y_pred))
    
    # your result
    print(kfold_df)
    
    Out[88]: 
              0       1
    0    0.1710  0.8290
    1    0.0083  0.9917
    2    0.2049  0.7951
    3    0.0038  0.9962
    4    0.0536  0.9464
    5    0.0632  0.9368
    6    0.1243  0.8757
    7    0.1150  0.8850
    8    0.0796  0.9204
    9    0.4096  0.5904
    ..      ...     ...
    990  0.0505  0.9495
    991  0.2128  0.7872
    992  0.0270  0.9730
    993  0.0434  0.9566
    994  0.8078  0.1922
    995  0.1452  0.8548
    996  0.1372  0.8628
    997  0.0127  0.9873
    998  0.0935  0.9065
    999  0.0065  0.9935
    
    [1000 rows x 2 columns]
    
    
    for r in report:
        print(r)
    
    for r in report:
        print(r)
                 precision    recall  f1-score   support
    
              0       0.93      0.59      0.72        22
              1       0.95      0.99      0.97       179
    
    avg / total       0.95      0.95      0.95       201
    
                 precision    recall  f1-score   support
    
              0       0.86      0.55      0.67        22
              1       0.95      0.99      0.97       179
    
    avg / total       0.94      0.94      0.93       201
    
                 precision    recall  f1-score   support
    
              0       0.89      0.38      0.53        21
              1       0.93      0.99      0.96       179
    
    avg / total       0.93      0.93      0.92       200
    
                 precision    recall  f1-score   support
    
              0       0.88      0.33      0.48        21
              1       0.93      0.99      0.96       178
    
    avg / total       0.92      0.92      0.91       199
    
                 precision    recall  f1-score   support
    
              0       0.88      0.33      0.48        21
              1       0.93      0.99      0.96       178
    
    avg / total       0.92      0.92      0.91       199
    
    0 讨论(0)
  • 2020-12-14 03:09

    Is this the correct process for feature selection? This is ONE of the many ways of feature selection. Recursive feature elimination is an automated approach to this, others are listed in scikit.learn documentation. They have different pros and cons, and usually feature selection is best achieved by also involving common sense and trying models with different features. RFE is a quick way of selecting a good set of features, but does not necessarily give you the ultimately best. By the way, you don't need to build your StratifiedKFold separately. If you just set the cv parameter to cv=3, both RFECV and GridSearchCV will automatically use StratifiedKFold if the y values are binary or multiclass, which I'm assuming is most likely the case since you are using LogisticRegression. You can also combine

    # Fit the features to the response variable
    rfecv.fit(X, y)
    
    # Put the best features into new df X_new
    X_new = rfecv.transform(X)
    

    into

    X_new = rfecv.fit_transform(X, y)
    

    Is this the correct process for hyper-parameter selection? GridSearchCV is basically an automated way of systematically trying a whole set of combinations of model parameters and picking the best among these according to some performance metric. It's a good way of finding well-suited parameters, yes.

    Is this the correct process for fitting? Yes, this is a valid way of fitting the model. When you call grid.fit(X_new, y), it makes a grid of LogisticRegression estimators (each with a set of parameters that are tried) and fits each of them. It will keep the one with the best performance under grid.best_estimator_, the parameters of this estimator in grid.best_params_ and the performance score for this estimator under grid.best_score_. It will return itself, and not the best estimator. Remember that with incoming new X values that you will use the model to predict on, you have to apply the transform with the fitted RFECV model. So, you can actually add this step to the pipeline as well.

    Where can I find the fitted coefficients for the selected features? The grid.best_estimator_ attribute is a LogisticRegression object with all this information, so grid.best_estimator_.coef_ has all the coefficients (and grid.best_estimator_.intercept_ is the intercept). Note that to be able to get this grid.best_estimator_, the refit parameter on GridSearchCV needs to be set to True, but this is the default anyway.

    0 讨论(0)
提交回复
热议问题