Explicitly specifying test/train sets in GridSearchCV

核能气质少年 提交于 2019-12-03 16:24:56
Vivek Kumar

As @MaxU said, its better to let the GridSearchCV handle the splits, but if you want to enforce the splitting as you have set in the question, then you can use the PredefinedSplit which does this very thing.

So you need to make the following changes to your code.

# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])


# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)

# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)

print(test_fold)
# OUTPUT: 
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)

# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1

for train_index, test_index in ps.split():
    print("TRAIN:", train_index, "TEST:", test_index)

# OUTPUT: 
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
   34]), 
 'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))


# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)

# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)

The X_train, y_train sent to fit() will be split into train and test (val in your case) using the split we defined and hence, the Ridge will be trained on original data from indices [0:35] and tested on [35:50].

Hope this clears the working.

Have you tried TimeSeriesSplit?

It was made explicitly for splitting time series data.

tscv = TimeSeriesSplit(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))

In time series data, Kfold is not a right approach as kfold cv will shuffle you data and you will lose pattern within series. Here is an approach

import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
tscv = TimeSeriesSplit(n_splits=2)

model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}

my_cv = TimeSeriesSplit(n_splits=2).split(X)
gsearch = GridSearchCV(estimator=model, cv=my_cv,
                        param_grid=param_search)
gsearch.fit(X, y)

reference - How do I use a TimeSeriesSplit with a GridSearchCV object to tune a model in scikit-learn?

I think the easiest way is to use integer number for cv - parameter:

grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=5)

beside this when using GridSearchCV you don't want to split your data into training and test data sets - GridSearchCV will do it for you automatically:

grid_search.fit(X_all_data, y_all_data)

this will use (Stratified)KFold method in order to split your data into training and test sets 5 times and will chose optimal parameters specified in the param_grid dictionary.

You may also want to use GroupShuffleSplit and specify groups in order to respect groups by splitting:

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!