Fine Tuning hyperparameters doesn't improve score of classifiers

十年热恋 提交于 2021-02-10 18:30:31

问题


I am experiencing a problem where finetuning the hyperparameters using GridSearchCV doesn't really improve my classifiers. I figured the improvement should be bigger than that. The biggest improvement for a classifier I've gotten with my current code is around +-0.03. I have a dataset with eight columns and an unbalanced binary outcome. For scoring I use f1 and I use KFold with 10 splits. I was hoping if someone could spot something which is off and I should look at? Thank you!

I use the following code:

model_parameters = {
    "GaussianNB": {     
    },
    "DecisionTreeClassifier": {
        'min_samples_leaf': range(5, 9),
        'max_depth': [None, 0, 1, 2, 3, 4]
    },
    "KNeighborsClassifier": {
        'n_neighbors': range(1, 10),
        'weights': ["distance", "uniform"]
    },
    "SVM": {
        'kernel': ["poly"],
        'C': np.linspace(0, 15, 30)
    },
    "LogisticRegression": {
        'C': np.linspace(0, 15, 30),
        'penalty': ["l1", "l2", "elasticnet", "none"]
    }
}

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
n_splits = 10
scoring_method = make_scorer(lambda true_target, prediction: f1_score(true_target, prediction, average="micro"))
cv = KFold(n_splits=n_splits, random_state=random_state, shuffle=True)

for model_name, parameters in model_parameters.items():

    # Models is a dict with 5 classifiers
    model = models[model_name]
    grid_search = GridSearchCV(model, parameters, cv=cv, n_jobs=-1, scoring=scoring_method, verbose=False).fit(X_train, y_train)
    
    cvScore = cross_val_score(grid_search.best_estimator_, X_test, y_test, cv=cv, scoring='f1').mean()
    classDict[model_name] = cvScore

回答1:


If your classes are unbalanced, when you do Kfold you should keep the proportion between the two targets.

Having folds unbalanced can lead to very poor results

check Stratified K-Folds cross-validator

Provides train/test indices to split data in train/test sets.

This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.

There are also a lot of techniques to handle unbalanced dataset. Based on the context:

  • upsampling the minority class (using for example the resample from sklearn)
  • under sampling the majority class (also this lib has some useful tools to do both under\up sampling)
  • handle the unbalance with your specific ML model

For example, in SVC, there is an argument when you create the model , class_weight='balanced'

clf_3 = SVC(kernel='linear', 
            class_weight='balanced', # penalize
            probability=True)

which will penalize more the errors on minority class.

You can change your config as such:

"SVM": {
        'kernel': ["poly"],
        'C': np.linspace(0, 15, 30),
        'class_weight': 'balanced'

    }

For LogisticRegression you can set the weights instead, reflecting the proportion of your classes

LogisticRegression(class_weight={0:1, 1:10}) # if problem is a binary one

changing the grid search dict in such way:

"LogisticRegression": {
        'C': np.linspace(0, 15, 30),
        'penalty': ["l1", "l2", "elasticnet", "none"],
        'class_weight':{0:1, 1:10}
    }

Anyway the approach depends on the used model. For neural network for example, you can change the loss function to penalize the minority class with a weighted calculation (the same of the logistic regression)



来源:https://stackoverflow.com/questions/64542349/fine-tuning-hyperparameters-doesnt-improve-score-of-classifiers

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!