scikit-learn cross validation custom splits for time series data

瘦欲@ 提交于 2019-12-02 22:26:30

You just have to pass an iterable with the splits to GridSearchCV. This split should have the following format:

[
 (split1_train_idxs, split1_test_idxs),
 (split2_train_idxs, split2_test_idxs),
 (split3_train_idxs, split3_test_idxs),
 ...
]

To get the idxs you can do something like this:

groups = df.groupby(df.date.dt.year).groups
# {2012: [0, 1], 2013: [2], 2014: [3], 2015: [4, 5]}
sorted_groups = [value for (key, value) in sorted(groups.items())] 
# [[0, 1], [2], [3], [4, 5]]

cv = [(sorted_groups[i] + sorted_groups[i+1], sorted_groups[i+2])
      for i in range(len(sorted_groups)-2)]

This looks like this:

[([0, 1, 2], [3]),  # idxs of first split as (train, test) tuple
 ([2, 3], [4, 5])]  # idxs of second split as (train, test) tuple

Then you can do:

GridSearchCV(estimator, param_grid, cv=cv, ...)

There is also the TimeSeriesSplit function in sklearn, which splits time-series data (i.e. with fixed time intervals), in train/test sets. Note that unlike standard cross-validation methods, successive training sets are supersets of those that come before them, i.e. in each split, test indices must be higher than before, and thus shuffling in cross validator is inappropriate.

There's standard sklearn approach to that, using GroupShuffleSplit. From the docs:

Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers.

For instance the groups could be the year of collection of the samples and thus allow for cross-validation against time-based splits.

Very much convenient for your use case. Here how it looks like:

cv = GroupShuffleSplit().split(X, y, groups)

And passing that to GridSearchCV like before:

GridSearchCV(estimator, param_grid, cv=cv, ...)
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!