时间序列交叉验证法_sklearn调用_及prophet交叉验证函数实现

文章目录

1. 时间序列特定的交叉验证方法简介
2. sklearn时间序列交叉验证包TimeSeriesSplit
- 递增窗口交叉验证Example
- 固定窗口交叉验证Example
- sklearn TimeSeriesSplit包的局限性：
其他包&自己实现
- 自己写prophet验证方法
Ref:

搞这个的初衷一是想对时间序列预测结果进行交叉验证；二是想和prophet自带的交叉验证方法进行比较。
然而sklearn自带的时序交叉验证包并不是那么好用，不太灵活，就决定自己写一个。

1. 时间序列特定的交叉验证方法简介

时间序列不能采用一般的K-FOLD校验，具体的原理可以看机器之心翻译的这篇文章：

一文简述如何使用嵌套交叉验证方法处理时序数据

时间序列一般采用递增时间窗交叉验证法，如下图所示：

2. sklearn时间序列交叉验证包TimeSeriesSplit

sklearn.model_selection.TimeSeriesSplit的包默认为递增窗口交叉验证。

官方doc及sample如下：

class sklearn.model_selection.TimeSeriesSplit(n_splits=5, ***, max_train_size=None)

Parameters

n_splitsint, default=5 --把时间序列数据拆成几份

Number of splits. Must be at least 2.Changed in version 0.22: n_splits default value changed from 3 to 5.
max_train_sizeint, default=None --限制每次training的大小。如果不设置，则默认从头开始训练（如上图所示）；如果设置，就可以控制成固定的moving window。见下文的example。

Maximum size for a single training set.

递增窗口交叉验证Example

>>> import numpy as np
>>> from sklearn.model_selection import TimeSeriesSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit()
>>> print(tscv)
TimeSeriesSplit(max_train_size=None, n_splits=5)
>>> for train_index, test_index in tscv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [0] TEST: [1]
TRAIN: [0 1] TEST: [2]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [0 1 2 3] TEST: [4]
TRAIN: [0 1 2 3 4] TEST: [5]

固定窗口交叉验证Example

如果想采用固定窗口交叉验证，可以限定max_train_size。

>>> import numpy as np
>>> from sklearn.model_selection import TimeSeriesSplit
>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(max_train_size=3, n_splits=3)  # 限定max_train_size
>>> for train_index, test_index in tscv.split(X):
...     print("TRAIN:", train_index, "TEST:", test_index)
...     X_train, X_test = X[train_index], X[test_index]
...     y_train, y_test = y[train_index], y[test_index]
TRAIN: [0 1 2] TEST: [3]
TRAIN: [1 2 3] TEST: [4]
TRAIN: [2 3 4] TEST: [5]

sklearn TimeSeriesSplit包的局限性：

从上述基于时间序列的数据集划分的时候，发现只能从过去N个时刻(N可自行设置)预测未来1个时刻的数据，不能满足我们想要的过去N个时刻预测未来M个时刻的数据。
假设test那一期数据与前面的train独立
解决：在Train和Test中间加入Gap，就不需要test与train独立的假设了。我猜是利用了马尔可夫性质。eg. TimeSeriesSplit(gap=2)
关于时间序列问题的交叉验证

其他包&自己实现

tscv包看起来不错，但是依旧没法和prophet得到完全一致的 train & test 数据集，所以自己写。

自己写prophet验证方法

有一篇看起来很牛X的可是好像和我需要的不太一样：How to Convert a Time Series to a Supervised Learning Problem in Python

想实现和prophet一样的交叉验证法，便于用不同的模型和prophet比较，于是自己写了一个。

prophet Diagnostics，详见prophet官方文档；以及译文：时间序列模型Prophet使用详细讲解。

根据 prophet Diagnostics 中提到的几个基本概念：cutoff / horizon / initial / period，我先自己写了一个。虽然后面没用到，还是觉得有点意思就放在下面。

实现思路及python代码：
根据index分组，生成[train_test_index_list]；list中每个元素为一个dict：{‘TRAIN’:[train index list],‘TEST’:[test index list]}。

def cv_split(data, horizon, initial=None, period=None):if period is None:period = int(horizon * 0.5)if initial is None:initial = horizon * 3    cv_list = []train = data.shape[0] - horizonwhile train>initial:data_train = data.iloc[:train]data_test = data.iloc[train:train+horizon]cv_list.insert(0,{'TRAIN':data_train.index,'TEST':data_test.index})train -= peroidreturn cv_list

但是结果和prophet不一样，因为prophet是基于时间的，并不是基于index。
因此，干脆把prophet开源代码拿来改改。如下：

def generate_cutoffs(df, horizon, initial, period):"""Generate cutoff datesParameters----------df: pd.DataFrame with historical data.horizon: pd.Timedelta forecast horizon.initial: pd.Timedelta window of the initial forecast period.period: pd.Timedelta simulated forecasts are done with this period.Returns-------list of pd.Timestamp"""# Last cutoff is 'latest date in data - horizon' datecutoff = df['ds'].max() - horizonif cutoff < df['ds'].min():raise ValueError('Less data than horizon.')result = [cutoff]while result[-1] >= min(df['ds']) + initial:cutoff -= period# If data does not exist in data range (cutoff, cutoff + horizon]if not (((df['ds'] > cutoff) & (df['ds'] <= cutoff + horizon)).any()):# Next cutoff point is 'last date before cutoff in data - horizon'if cutoff > df['ds'].min():closest_date = df[df['ds'] <= cutoff].max()['ds']cutoff = closest_date - horizon# else no data left, leave cutoff as is, it will be dropped.result.append(cutoff)result = result[:-1]if len(result) == 0:raise ValueError('Less data than horizon after initial window. ''Make horizon or initial shorter.')logger.info('Making {} forecasts with cutoffs between {} and {}'.format(len(result), result[-1], result[0]))return reversed(result)def cv_split(df, horizon, period=None, initial=None):"""modified by hlmandy"""cv_list = []horizon = pd.Timedelta(horizon)# Set periodperiod = 0.5 * horizon if period is None else pd.Timedelta(period)initial = 3 * horizon if initial is None else pd.Timedelta(initial)cutoffs = generate_cutoffs(df, horizon, initial, period)for cutoff in list(cutoffs):train_index = (df['ds'] <= cutoff)test_index = (df['ds'] > cutoff) & (df['ds'] <= cutoff + horizon)cv_list.append({'CUTOFF':cutoff,'TRAIN':train_index,'TEST':test_index})return cv_list

Done! 可以愉快地用其他预测模型和prophet对比了！

Ref:

sklearn.model_selection.TimeSeriesSplit
TSCV_ZHENGWENJIE
关于时间序列问题的交叉验证
How to Convert a Time Series to a Supervised Learning Problem in Python
prophet官方文档
时间序列模型Prophet使用详细讲解