季节性ARIMA：时间序列预测

SARIMAX (seasonal autoregressive integrated moving average with exogenous regressor)是一种常见的时间序列预测方法，可以分为趋势部分和周期性部分；每个部分又可以分为自回归、差分和平滑部分。

趋势稳定性检测：Kwiatkowski–Phillips–Schmidt–Shin (KPSS) test

null-hypothesis: 时间序列趋势稳定。significance: 0.05. 选择KPSS和不是Dickey Fuller由于KPSS的null-hypothesis是趋势稳定，所以接受条件相对于Dickey Fuller更宽松，差分阶数更少。
y t = β ′ D t + μ t + u t y_t = \beta^{\prime} D_t + \mu_t + u_t yt=β′Dt+μt+ut
μ t = μ t − 1 + ϵ t , \mu_t = \mu_{t-1} + \epsilon_t, μt=μt−1+ϵt,
ϵ t ∼ W N ( 0 , σ ϵ 2 ) \epsilon_t \sim WN(0, \sigma^2_\epsilon) ϵt∼WN(0,σϵ2)

其中D为确定时间序列趋势/常数。u为随机漫步。epsilon为残差。

LM statistic:
K P S S = ( T − 2 ∑ t = 1 T S ^ t 2 ) / λ ^ 2 KPSS = (T^-2 \sum_{t=1}^{T} \hat S_t^2) / \hat\lambda^2 KPSS=(T−2∑t=1TS^t2)/λ^2
其中 S ^ t = ∑ j = 1 t u ^ j \hat S_t = \sum_{j=1}^{t} \hat u_j S^t=∑j=1tu^j
u ^ \hat{u} u^ 是yt 拟合Dt的残差，$ \hat \lambda^2$ 是var(ut)的预估值。问题归结为拉格朗日乘数法证明 σ ϵ 2 = 0 \sigma^2_\epsilon = 0 σϵ2=0

季节稳定性检测：Canova-Hansen方法
null-hypothesis: 时间序列季节稳定性。significance: 0.05

测试频率：给定待测最大频率m以下所有 2pi/m整数倍。

时间序列yi的拟合：
y i = μ + x i ′ β + S i + e i y_i = \mu + x^{\prime}_i \beta + S_i + e_i yi=μ+xi′β+Si+ei
S i = ∑ j = 1 m / 2 f j i ′ γ j S_i = \sum_{j=1}^{m/2} f^{\prime}_{ji} \gamma_j Si=∑j=1m/2fji′γj
其中
f j i ′ = ( c o s ( ( j / q ) π i ) , s i n ( ( j / q ) π i ) ) f^{\prime}_{ji} = (cos((j/q) \pi i), sin((j/q) \pi i)) fji′=(cos((j/q)πi),sin((j/q)πi))

LM statistic:
L M = 1 n 2 t r a c e ( ( Ω ^ f ) − 1 ∑ i = 1 n F ^ i F ^ i ′ ) LM = \frac{1}{n^2} trace( (\hat \Omega ^f)^-1 \sum_{i=1}^{n} \hat F_i \hat F^{\prime}_i) LM=n21trace((Ω^f)−1i=1∑nF^iF^i′)

其中
F ^ i = ∑ t = 1 i f t e ^ t \hat F_i = \sum_{t=1}^{i}f_t \hat e_t F^i=∑t=1ifte^t
Ω ^ f = ∑ k = − m m w ( k m ) 1 n ∑ i f i + k e ^ i + k f i ′ e ^ i \hat \Omega^f = \sum_{k=-m}^{m} w (\frac{k}{m}) \frac{1}{n} \sum_{i} f_{i+k} \hat e_{i+k} f^{\prime}_i \hat e_i Ω^f=∑k=−mmw(mk)n1∑ifi+ke^i+kfi′e^i

用根据待检测频率m计算f’ji向量：用sample data示例代码如下：

import numpy as np
import pandas as pddef seasonalDummy(tsArray, frequency):n = len(tsArray)m = frequency#if m == 1: tsArray.reshape([n, m])tt = np.arange(1, n + 1, 1)mat = np.zeros([n, 2 * m], dtype = float)for i in np.arange(0, m):mat[:, 2 * i] = np.cos(2.0 * np.pi * (i + 1) * tt / m)mat[:, 2 * i + 1] = np.sin(2.0 * np.pi * (i + 1) *tt / m)return mat[:, 0:(m-1)]tsArray = np.array([  4.00195672,   4.99944175,   5.99861146,   7.00000213,8.00124207,   9.00059539,  10.00029542,  10.99871969,11.99728933,  12.99965231,  14.00148869,  15.0006378 ,16.00117112,  17.00159081,  17.99848509,  18.99957668,20.00066721,  20.99932292,  21.99992471,  23.00099164,24.00127222,  25.00014385,  26.00014191,  27.00037435,27.9985619 ,  28.99949718,  29.99844772,  30.99911627,31.99965086,  33.00211019,  34.00240288,  34.99889972,36.00240406,  37.0002379 ,  37.99958145,  38.99825111,39.99932529,  40.9998374 ,  42.00034236,  43.00206289,43.9994545 ,  45.00141283,  46.00041818,  47.00132581,48.00216031,  48.99812424,  50.00060522,  51.00049892,51.99817633,  52.9997362 ])
frequency = 6
pd.DataFrame(seasonalDummy(tsArray, frequency)).head(2)

Canova-Hansen用sample data示例代码如下：

from numpy.linalg import lstsq as lsqn = len(tsArray)
frec = np.ones(int((frequency + 1) / 2), dtype = int)
ltrunc = int(np.round(frequency * np.power(n / 100.0, 0.25)))

y i = μ + x i ′ β + S i + e i y_i = \mu + x^{\prime}_i \beta + S_i + e_i yi=μ+xi′β+Si+ei
S i = ∑ j = 1 m / 2 f j i ′ γ j S_i = \sum_{j=1}^{m/2} f^{\prime}_{ji} \gamma_j Si=∑j=1m/2fji′γj

其中 f j i ′ = ( c o s ( ( j / q ) π i ) , s i n ( ( j / q ) π i ) ) f^{\prime}_{ji} = (cos((j/q) \pi i), sin((j/q) \pi i)) fji′=(cos((j/q)πi),sin((j/q)πi))

# create dummy column f'ji
r1 = seasonalDummy(tsArray, frequency)
#create intercept column for regression
r1wInterceptCol = np.column_stack([np.ones(r1.shape[0], dtype = float), r1])# residual ei:
result = lsq(a = r1wInterceptCol, b = tsArray)
residual = tsArray - np.matmul(r1wInterceptCol, result[0])

long-run covariance matrix： Ω = l i m n → ∞ 1 n E ( F n F n ′ ) \Omega = lim_{n \to \infty}\frac{1}{n}E(F_n F_n^{\prime}) Ω=limn→∞n1E(FnFn′)
在ei可能有serial correlation的情况下可以用estimate
Ω ^ = ∑ k = − m m w ( k m ) 1 n ∑ i F i + k F i ′ \hat{\Omega} = \sum_{k=-m}^{m}w(\frac{k}{m})\frac{1}{n}\sum_{i}F_{i+k}F_i^{\prime} Ω^=k=−m∑mw(mk)n1i∑Fi+kFi′

fhat = np.zeros([n, frequency - 1], dtype = float)
fhataux = np.zeros([n, frequency - 1], dtype = float)for i in np.arange(0, frequency - 1):fhataux[:, i] = r1[:, i] * residualfor i in np.arange(0, n):for j in np.arange(0, frequency - 1):mySum = sum(fhataux[0:(i + 1), j])fhat[i, j] = mySum

w ( ⋅ ) w(\cdot) w(⋅) 是kernel function，来自rob j. Hyndman的forecast包:

wnw = np.ones(ltrunc, dtype = float) - np.arange(1, ltrunc + 1, 1) / (ltrunc + 1)

Ω ^ f \hat{\Omega}^f Ω^f的计算：

Ne = fhataux.shape[0]
omnw = np.zeros([fhataux.shape[1], fhataux.shape[1]], dtype = float)
for k in np.arange(0, ltrunc):omnw = omnw + np.matmul(fhataux.T[:, (k+1):Ne], fhataux[0:(Ne-(k+1)), :]) * float(wnw[k])
cross = np.matmul(fhataux.T, fhataux)
omnwplusTranspose = omnw + omnw.T
omfhat = (cross + omnwplusTranspose) / float(Ne)

Generalized Hannan’s model：通过设置矩阵A选择需要测试的频率的子集。在程序中用的A = eye:

sq = np.arange(0, frequency - 1, 2)
frecob = np.zeros(frequency - 1, dtype = int)
for i in np.arange(0, len(frec)):if (frec[i] == 1) & (i + 1 == int(frequency / 2.0)):frecob[sq[i]] = 1if (frec[i] == 1) & (i + 1 < int(frequency / 2.0)):frecob[sq[i]] = 1frecob[sq[i] + 1] = 1a = frecob.tolist().count(1)  # find nr of 1's in frecob
A = np.zeros([frequency - 1, a], dtype = float)
j = 0
for i in np.arange(0, frequency - 1):if frecob[i] == 1:A[i, j] = 1j = j + 1

LM statistic的计算（Nyblom(1989), Hansen(1990)）：当LM statistic值超过对应自由度的critical value时，拒绝（null hypothesis = 没有单位根）：
L M s t a t i s t i c = 1 n 2 t r a c e ( ( A ′ Ω ^ f A ) − 1 A ′ ∑ i = 1 n F i ^ F i ^ ′ A ) LM statistic = \frac{1}{n^2} trace((A' \hat{\Omega}^f A)^{-1} A' \sum_{i=1}^{n}\hat{F_i} \hat{F_i}^{\prime}A) LMstatistic=n21trace((A′Ω^fA)−1A′∑i=1nFi^Fi^′A)

其中 ∑ i = 1 n F i ^ \sum_{i=1}^{n}\hat{F_i} ∑i=1nFi^ 和 ∑ i = 1 n F i ^ ′ \sum_{i=1}^{n} \hat{F_i}^{\prime} ∑i=1nFi^′由前面步骤中的 f ^ \hat{f} f^ 给出:

from numpy.linalg import svdaTomfhat = np.matmul(A.T, omfhat)
tmp = np.matmul(aTomfhat, A)machineDoubleEps = 2.220446e-16problems = min(svd(tmp)[1]) < machineDoubleEps # svd[1] are the singular values
if problems:stL = 0.0
else:solved = np.linalg.solve(tmp, np.eye(tmp.shape[1], dtype = float))step1 = np.matmul(solved, A.T)step2 = np.matmul(step1, fhat.T)step3 = np.matmul(step2, fhat)step4 = np.matmul(step3, A)stL = (1.0 / np.power(n, 2.0)) * sum(np.diag(step4))

在Canova-Hansen test接受alternative hypothesis的情况下对时间序列进行lag = 待测频率的差分（季节差分）。

季节性检测：
在进行季节性时间序列稳定性检测之前，首先判断a.时间序列是否有季节性，和b.时间序列在什么频率上有季节性。结果会作为时间序列稳定性检测的参数输入。

季节性检测根据离散傅里叶变换和自相关函数的“与”关系得出结论（只有两个method都返回真值，才会判定时间序列有季节性）。

离散傅里叶变换吧时间序列从时域变为频域。变换后频域的新序列为：

X k = ∑ n = 0 N − 1 x n ⋅ [ c o s ( 2 π k n / N ) − i ⋅ s i n ( 2 π k n / N ) ] X_k = \sum_{n=0}^{N-1}x_n \cdot [cos(2 \pi kn/N) - i \cdot sin(2 \pi kn/N)] Xk=n=0∑N−1xn⋅[cos(2πkn/N)−i⋅sin(2πkn/N)]

在待检测频率上如果能量为最大值，则返回真值。

自相关函数检测最大lag = 待检频率各阶上的correlation coefficient。时间点t和s之间的自相关性R的定义为：

R ( s , t ) = E [ ( X t − μ t ) ( X s − μ s ) σ t σ s R(s, t) = \frac{E[(X_t - \mu_t)(X_s - \mu_s)}{\sigma_t \sigma_s} R(s,t)=σtσsE[(Xt−μt)(Xs−μs)

如果在待检频率上的相关系数超过双边confidence interval在0.05的临界值

clim = norm.ppf((1 + ci) / 2) / np.sqrt(n)

则method返回真值。

DFT调用了numpy.fft.fft方法。ACF调用了statsmodels.tsa.stattools.acf方法。

测量函数：根据不同情况采用不同模型测量方法。算法使用了Rob J. Hyndman的 MASE (mean absolute scaled error)。与其他测量方法的优劣对比。

定义：
R-squared:
S S t o t = ∑ i ( y i − y ˉ ) 2 SS_{tot} = \sum_{i}(y_i - \bar{y})^2 SStot=∑i(yi−yˉ)2
S S r e s = ∑ i ( y i − y ^ i ) 2 SS_{res} = \sum_{i}(y_i - \hat{y}_i)^2 SSres=∑i(yi−y^i)2
R 2 = 1 − S S r e s S S t o t R^2 = 1 - \frac{SS_{res}}{SS_{tot}} R2=1−SStotSSres

RMSE:
R M S E = ∑ t = 1 n ( y ^ t − y t ) 2 n RMSE = \sqrt{\frac{\sum_{t=1}^{n} (\hat{y}_t - y_t)^2}{n}} RMSE=n∑t=1n(y^t−yt)2

MAE:
M A E = ∑ t = 1 n ∣ y ^ t − y t ∣ n MAE = \frac{\sum_{t=1}^{n} |\hat{y}_t - y_t|}{n} MAE=n∑t=1n∣y^t−yt∣

MAPE:
M A P E = 100 n ∑ t = 1 n ∣ y ^ t − y t y t ∣ MAPE = \frac{100}{n} \sum_{t=1}^{n} | \frac{\hat{y}_t - y_t}{y_t} | MAPE=n100∑t=1n∣yty^t−yt∣

sMAPE:
S M A P E = 100 n ∑ t = 1 n ∣ y ^ t − y t ∣ ( ∣ y ^ t ∣ + ∣ y t ∣ ) / 2 SMAPE = \frac{100}{n} \sum_{t=1}^{n} \frac{|\hat{y}_t - y_t|}{(|\hat{y}_t| + |y_t|)/2} SMAPE=n100∑t=1n(∣y^t∣+∣yt∣)/2∣y^t−yt∣

MASE without seasonality:
M A S E = ∑ t = 1 n ∣ y ^ t − y t ∣ n n − 1 ∑ t = 2 n ∣ y t − y t − 1 ∣ MASE = \frac{\sum_{t = 1}^{n} |\hat{y}_t - y_t|}{\frac{n}{n-1} \sum_{t=2}^{n} |y_t - y_{t-1}|} MASE=n−1n∑t=2n∣yt−yt−1∣∑t=1n∣y^t−yt∣

MASE with seasonality:
M A S E = ∑ t = 1 n ∣ y ^ t − y t ∣ n n − m ∑ t = m + 1 n ∣ y t − y t − m ∣ MASE = \frac{\sum_{t = 1}^{n} |\hat{y}_t - y_t|}{\frac{n}{n-m} \sum_{t=m+1}^{n} |y_t - y_{t-m}|} MASE=n−mn∑t=m+1n∣yt−yt−m∣∑t=1n∣y^t−yt∣

趋势平滑：在SARIMA模型中引入时间序列的趋势作为exogenous regressor(X)，有几种算法可以选择：

Lowess (locally weighted scatterplot smoothing): 基于KNN的非参数拟合方法。代码调用了 statsmodels.api.nonparametric.lowess

RANSAC
Random sample consensus，一种robust regression方法，可以探测异常值并使拟合对于异常值的敏感度降低。代码调用sklearn.linear_model.RANSACRegressor

Weighted moving average:

指数平滑递归表达：
W n = ( 1 − α ) ∗ W n − 1 + α ∗ y n W_n = (1-\alpha) * W_{n-1} + \alpha * y_n Wn=(1−α)∗Wn−1+α∗yn
W 0 = y 0 W_0 = y_0 W0=y0
α = 2 / ( s p a n + 1 ) \alpha = 2 / (span + 1) α=2/(span+1)

调用了 pandas.ewma

关于脉冲响应：

如果有另外的（多维）exogenous regressor Xi 影响预测模型，比如类似离散脉冲波形的机会点数据：假设各个脉冲regressor之间是独立的，并不受时间序列本身的影响：可以用多元线性回归首先发现Xi 和时间序列趋势yhat的关系。

考虑到历史上时间序列对于脉冲输入的响应不同，在算法中会测试三种不同脉冲响应与时间序列的相关性，并挑选相关性最强的脉冲响应作为Xi向量。这三种分别是a. 原始脉冲，b. 经指数平滑的脉冲，c. 经指数平滑并累加的脉冲（cumulative)。

示例：

%matplotlib inline
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')def myEwma(x, histPeriod = 6, fcstPeriod = 18):    from pandas import ewmaxfit = ewma(x, span = histPeriod)xpred = np.zeros(fcstPeriod)tmp = np.zeros(histPeriod + 1)tmp[:histPeriod] = x[-histPeriod:].copy()tmp[histPeriod] = xfit[-1]for i in np.arange(0, fcstPeriod):xpred[i] = ewma(tmp, span = histPeriod)[-1]tmp = shiftLeft(tmp)tmp[-1] = xpred[i]return xfit, xpreddef shiftLeft(_ar, step = 1):if step == 0:return _arar = _ar.copy()ar[0:-step] = ar[step:]ar[-step:] = 0return aroriginal = np.array([3000,0,0,0,0,0,0,0,0,0,1000,0,0,0,0,0,0,0,0,1000,0,0,0,0,0,0,0,0,0,0])
pd.DataFrame(original).plot(title = 'original')
movingAverage = myEwma(original)[0]
pd.DataFrame(movingAverage).plot(title = 'moving average')
cumulative = movingAverage.cumsum()
pd.DataFrame(cumulative).plot(title = 'cumulative')

检测相加性：时间序列拆分成趋势，季节性和残差的方式有相加和相乘两种。

y = Trend + Seasonality + Residual, 或者

y = Trend * Seasonality * Residual

如果是相乘的情况，残差的分布是不稳定的。所以如果时间序列没有通过相加性检测，会对时间序列做对数处理，变为相加：

y n e w = l o g ( y ) = l o g ( T r e n d ∗ S e a s o n a l i t y ∗ R e s i d u a l ) = l o g ( T r e n d ) + l o g ( S e a s o n a l i t y ) + l o g ( R e s i d u a l ) ynew = log(y) = log(Trend * Seasonality * Residual) = log(Trend) + log(Seasonality) + log(Residual) ynew=log(y)=log(Trend∗Seasonality∗Residual)=log(Trend)+log(Seasonality)+log(Residual)

自动搜索SARIMA参数空间 Auto ARIMA：搜索差分阶数，检测季节性的存在，并搜索能给出Akaike Information Criterion最小值的ARMA模型维度 ARMA(p, q, P, Q)

ARMA模型范式：

X t − α 1 X t − 1 − ⋯ − α p ′ X t − p ′ = ϵ t + θ 1 ϵ t − 1 + ⋯ + θ q ϵ t − q X_t - \alpha_1 X_{t-1} - \dots - \alpha_{p^{\prime}} X_{t-p^{\prime}} = \epsilon_t + \theta_1 \epsilon_{t-1} + \dots + \theta_q \epsilon_{t-q} Xt−α1Xt−1−⋯−αp′Xt−p′=ϵt+θ1ϵt−1+⋯+θqϵt−q

自动搜索参数空间返回的结果是p和q的个数。作为输入给SARIMAX核心算法

搜索参数空间的初始值来自Rob J. Hyndman的paper section3.2.

SARIMAX 带外部变量的季节性自回归差分平滑算法：预测是基于SARIMAX的state-space representation。核心算法调用了statsmodels.api.tsa.statespace.SARIMAX.

总结

以上是SARIMAX各个功能模块的详细数学方法和编程实现。由于一些语言（包括python）的时间序列预测开源算法包里缺少比如稳定性检测、季节性差分和自动搜索参数空间的功能，需要自己依照数学公式实现。

季节性ARIMA：时间序列预测相关推荐

时序预测 | MATLAB实现ARIMA时间序列预测(GDP预测)
时序预测 | MATLAB实现ARIMA时间序列预测(GDP预测) 目录时序预测 | MATLAB实现ARIMA时间序列预测(GDP预测) 预测效果基本介绍模型设计模型分析学习总结参考资料 ...
基于趋势和季节性的时间序列预测
时间序列预测是基于时间数据进行预测的任务.它包括建立模型来进行观测,并在诸如天气.工程.经济.金融或商业预测等应用中推动未来的决策. 本文主要介绍时间序列预测并描述任何时间序列的两种主要模式(趋势和季 ...
基于python的时间序列案例-案例-基于自动PDQ值的ARIMA时间序列预测应用
Python的科学计算和数据挖掘相关库中,pandas和statsmodels都提供了时间序列相关分析功能,本示例使用的是statsmodels做时间序列预测应用.有关时间序列算法的选择,实际场景中最 ...
时间序列预测，非季节性ARIMA及季节性SARIMA
Python 3中使用ARIMA进行时间序列预测的指南在本教程中,我们将提供可靠的时间序列预测.我们将首先介绍和讨论自相关,平稳性和季节性的概念,并继续应用最常用的时间序列预测方法之一,称为ARIM ...
python用ARIMA模型预测CO2浓度时间序列实现
全文下载链接:http://tecdat.cn/?p=20424 时间序列为预测未来数据提供了方法.根据先前的值,时间序列可用于预测经济,天气的趋势.时间序列数据的特定属性意味着通常需要专门的统计方法 ...
时间序列预测-入门概念
时间序列定义时间序列(英语:time series)是一组按照时间发生先后顺序进行排列的数据点序列.通常一组时间序列的时间间隔为一恒定值(如1秒,5分钟,12小时,7天,1年),因此时间序列可以作为 ...
SPSS软件实操——ARIMA时间序列预测模型
相关文章链接时间序列预测--ARIMA模型https://blog.csdn.net/beiye_/article/details/123317316?spm=1001.2014.3001.5501 ...
机器学习（MACHINE LEARNING）使用ARIMA进行时间序列预测
文章目录 1 引言 2 简介 3 python代码实现 4 代码解析 1 引言在本文章中,我们将提供可靠的时间序列预测.我们将首先介绍和讨论自相关,平稳性和季节性的概念,并继续应用最常用的时间序列预 ...
理论加实践，终于把时间序列预测ARIMA模型讲明白了
上篇我们一起学习了一些关于时间序列预测的知识.而本文将通过一段时间内电力负荷波动的数据集来实战演示完整的ARIMA模型的建模及参数选择过程,其中包括数据准备.随机性.稳定性检验.本文旨在实践中学习,在 ...

季节性ARIMA：时间序列预测

总结

季节性ARIMA：时间序列预测相关推荐

最新文章

热门文章