预测10家商店未来三个月50种商品的销售量

一.前言

目前拥有10家店50种商品过去5年内的销售量，尝试通过建立ARIMAL，回归，GBDT模型来预测未来一年的销量

时间序列提供了预测未来价值的机会。基于以前的价值观，可以使用时间序列来预测经济，天气和能力规划的趋势。时间序列数据的具体属性意味着通常需要专门的统计方法。

数据分析前提几个问题

50种商品在过去5年的销售量表现状况如何?
10家商店在过去5年的销售量表现状况如何?
50种商品销售量与时间的联系如何?
通过10家商店50种商品过去5年的销售量来预测未来三个月50种商品的销售量，表现如何?

二.数据分析与探索

导入相应的库和数据，并进行初观察

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import gc
import lightgbm as lgb
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, GradientBoostingRegressor
from sklearn.model_selection import TimeSeriesSplit, KFold， GridSearchCV, train_test_split
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')%matplotlib inline

df = pd.read_csv("/home/kesci/input/data9269/train.csv")
df_pred = pd.read_csv("/home/kesci/input/data9269/test.csv")
df["year_month"] = df["date"].str[: -3]
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["week"] = df["date"].dt.weekday_name
df["store"] = df["store"].apply(lambda x: "store {}".format(x))
df["item"] = df["item"].apply(lambda x: "item {}".format(x))

50种商品在过去5年的销售量表现状况

plt.figure(figsize=(16, 6))
plt.title("Product Sales Volume")
sns.set(style="whitegrid")df.groupby("item")["sales"].sum().plot(kind="bar");

10家商店在过去5年的销售量表现状况

plt.figure(figsize=(16, 6))
plt.title("Store Sales Volume")df.groupby("store")["sales"].sum().plot(kind="bar");

50种商品销售量与时间的联系

def plot_salesvolume(store, item, col):data = df.query("store == store and item == item")y = data.groupby(col)["sales"].sum()plt.figure(figsize=(14, 4))plt.xticks(rotation=90)plt.title("Product {} Sales Volume at Store {}".format(store, item))sns.lineplot(y.index, y)

销量与月份的关联性

plot_salesvolume(4, 7, "year_month")

销量与年份的关联性

plot_salesvolume(np.random.randint(1, 11), np.random.randint(1, 11), "year")

销量与月份的关联性

plot_salesvolume(np.random.randint(1, 11), np.random.randint(1, 11), "month")

销量与星期X的关联性

plot_salesvolume(np.random.randint(1, 11), np.random.randint(1, 11), "week")

三.建模与预测

数据清洗

week = pd.get_dummies(df.week)
df = pd.concat([df, week], axis=1)
feature_column = ['store', 'item', 'month', 'week']
X = pd.DataFrame()
y = df['sales']
for i in feature_column:append_list = pd.get_dummies(df[i])X = pd.concat([X, append_list], axis =1)

时间序列

y_r.index = df["date"]

def test_stationarity(timeseries):#Determing rolling statisticsrolmean = y_r.rolling(12).mean().bfill()rolstd = y_r.rolling(12).std().bfill()#Plot rolling statistics:plt.figure(figsize=(16,6))orig = plt.plot(timeseries, color='blue',label='Original')mean = plt.plot(rolmean, color='red', label='Rolling Mean')std = plt.plot(rolstd, color='black', label = 'Rolling Std')plt.legend(loc='best')plt.title('Rolling Mean & Standard Deviation')plt.show(block=False)

test_stationarity(y_r)

y_r.rolling(3).mean().bfill().plot(figsize=(16,6));

先尝试用随机森林分类建模预测及观察特征

plt.figure(figsize=(16, 6))
sns.distplot(df['sales']);

cut_list = []
for i in range(5):cut_list.append(np.percentile(df['sales'], i*25))
print(cut_list)
y_ = pd.cut(df['sales'], cut_list).astype(str)

time: 341 µs

X_train, X_test, y_train, y_test = train_test_split(X, y_, test_size=0.2, random_state=40)
clf = RandomForestClassifier().fit(X_train, y_train)
print("train score: {:.4f}".format(clf.score(X_train, y_train)))
print("test score: {:.4f}".format(clf.score(X_test, y_test)))

train score: 0.8424
test score: 0.7471
time: 43.3 s

best_rf = pd.DataFrame({})
i = 0
for n_estimator in np.arange(1, 52, 10):for min_samples_leaf in np.arange(1, 10, 2):clf = RandomForestClassifier(n_jobs=-1,n_estimators=n_estimator,min_samples_leaf=min_samples_leaf)clf.fit(X_train, y_train)train_score = clf.score(X_train, y_train)test_score = clf.score(X_test, y_test)print("train score: {:.4f}".format(train_score))print("test score: {:.4f}".format(test_score))best_rf.loc[i, 'n_estimator'] = n_estimatorbest_rf.loc[i, 'min_samples_leaf'] = min_samples_leafbest_rf.loc[i, 'train_score'] = train_scorebest_rf.loc[i, 'test_score'] = test_scorei += 1

time: 439 µs

best_rf[['train_score', 'test_score']].plot(figsize=(16, 6));

time: 399 ms

model_rf = RandomForestClassifier(n_jobs=-1,n_estimators=51,min_samples_leaf=3) #建立RandomForestClassifiermodel_rf.fit(X_train, y_train)  # 训练交叉检验模型
print("train score: {:.4f}".format(model_rf.score(X_train, y_train)))
print("test score: {:.4f}".format(model_rf.score(X_test, y_test)))

train score: 0.8316
test score: 0.7854
time: 3min 8s

影响销量的特征

pd.Series(model_rf .feature_importances_, X.columns).sort_values(ascending=False).plot.bar(figsize=(16, 6));

time: 1.58 s

使用DecisionTreeRegressor建模，预测销量

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
dt_Regressor = DecisionTreeRegressor().fit(X_train, y_train)print("train score: {:.4f}".format(dt_Regressor.score(X_train, y_train)))
print("test score: {:.4f}".format(dt_Regressor.score(X_test, y_test)))

# #设置参数矩阵：
param_grid = [{'min_samples_split': np.arange(2, 5)}]
dt_sv = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5)
dt_sv.fit(X, y)

print("best param:{0}\nbest score:{1}".format(dt_sv.best_params_, r2_score(dt_sv.predict(X), y)))
dtr_best = dt_sv.best_estimator_

best param:{'min_samples_split': 4}
best score:0.8842986156506648

plt.style.use("ggplot")  # 应用ggplot自带样式库
plt.figure(figsize=(16, 6))  # 建立画布对象
plt.plot(np.arange(X.shape[0]), y, label='True')  # 画出原始变量的曲线
plt.plot(np.arange(X.shape[0]), dt_Regressor.predict(X), label='Predicted.')  # 画出预测变量曲线
plt.legend(loc=0)  # 设置图例位置
plt.show()  # 展示图像

time: 3.97 s

使用线性回归建模，预测销量

plt.scatter(x= df["year"].values, y = y, alpha = 0.2)

<matplotlib.collections.PathCollection at 0x7ffb17cb47b8>

lr = LinearRegression().fit(pd.DataFrame(df["year"]), y)

将两个模型合并，预测销量

X_dt = pd.Series(dtr_best.predict(X))
X_lr = pd.Series(lr.predict(pd.DataFrame(df["year"])))
XX = pd.concat([X_lr, X_dt], axis=1)XX_train, XX_test, yy_train, yy_test = train_test_split(XX, y, test_size=0.2, random_state=1)

直接用GradientBoostingRegressor融合两个模型

final_model = GradientBoostingRegressor().fit(XX_train, yy_train)
print(r2_score(y, final_model.predict(XX)))

0.9394880975999979

model_gbr = GradientBoostingRegressor()  # 建立GradientBoostingRegressor回归对象
parameters = {'loss': ['ls', 'lad'],'min_samples_leaf': [2, 3, 4],'alpha': [0.3, 0.6, 0.9]}  # 定义要优化的参数信息
model_final = GridSearchCV(estimator=model_gbr, param_grid=parameters, cv=5)  # 建立交叉检验模型对象
model_final.fit(XX, y)  # 训练交叉检验模型
print ('Best score is:', model_final.best_score_)  # 获得交叉检验模型得出的最优得分
print ('Best parameter is:', model_final.best_params_)  # 获得交叉检验模型得出的最优参数

plt.style.use("ggplot")  # 应用ggplot自带样式库
plt.figure(figsize=(16, 6))  # 建立画布对象
plt.plot(np.arange(XX.shape[0]), y, label='true y')  # 画出原始变量的曲线
plt.plot(np.arange(XX.shape[0]), final_model.predict(XX), label='predicted y')  # 画出预测变量曲线
plt.legend(loc=0)  # 设置图例位置
plt.show()  # 展示图像

三.总结和评估

50种商品在过去5年的总销售量才30W-160W之间，15号产品销量最好，5号产品销量最差
10家店铺在过去5年的总销售量才300W-620W之间，2号店销量最好，7号店销量最差
产品的销量逐年递增，呈周期性波动；每年7月销量最好，1月销量最差；周末销量最好，周一销量最差
通过10家商店50种商品过去5年的销售量来预测未来三个月50种商品的销售量，开始先用决策树训练离散数据，用回归训练连续数据，直接训练速度和方差都较大，将两个模型融合后 r2_score由80%提升到93，9%

10家不同商店50种不同商品销售量预测数据相关推荐

马海峰,杨家海,计算机应用,一种非同频远程数据持有检测方法
[1] REN K, WANG C, WANG Q. Security challenges for the public cloud[J]. IEEE Internet Computing, 201 ...
C++习题商品销售（商店销售某一商品，每天公布统一的折扣(discount)。同时允许销售人员在销售时灵活掌握售价(price)，在此基础上，一次购10件以上者，还可以享受9.8折优惠。）...
Description 商店销售某一商品,每天公布统一的折扣(discount).同时允许销售人员在销售时灵活掌握售价(price),在此基础上,一次购10件以上者,还可以享受9.8折优惠.现已知当天 ...
跨店购买每满400减50，部分商品每满199减20。把这两种满减策略分别用类型1和2表示，下面，请根据小明购买的商品价格，算算他一共省了多少钱（假设小明有足够多的购物津贴）。
题目描述 2019年9月马云退休,现任CEO张勇接任.你知道吗?2009年中下旬,时任淘宝商城总裁的张勇和他的团队,为了做大淘宝商城的品牌,策划了一个嘉年华式的网上购物节,当时他们选择11月,因为它刚 ...
习题 9.9 商店销售某一商品，商店每天公布统一的折扣（discount）。同时允许销售人员在销售时灵活掌握售价（price），在此基础上，对一次购10件以上者，还可以享受9.8折优惠。
C++程序设计(第三版) 谭浩强习题9.9 个人设计习题 9.9 商店销售某一商品,商店每天公布统一的折扣(discount).同时允许销售人员在销售时灵活掌握售价(price),在此基础上,对一 ...
Python ELM模型预测美国10个商店3049个商品销售量 ANN人工神经网络
问题描述美国有10个商店,每个商店有3049个商品,统计了1914天内各个商品的价格和销售数量,以及每天的属性(节日,打折活动等),通过构建非时序模型预测1914天-1941天各个商品销售数量,使用 ...
商店销售某一商品，每天公布统一的折扣（discount）。同时允许销售人员在销售时灵活掌握售价（price），在此基础上，一次购10件以上者，还可以享受9.8折优惠。现已知当天3个销货员的销售情况为
商店销售某一商品,每天公布统一的折扣(discount).同时允许销售人员在销售时灵活掌握售价(price),在此基础上,一次购10件以上者,还可以享受9.8折优惠.现已知当天3个销货员的销售情况为 ...
谭浩强 C++面向对象程序设计 118页第9题 9.商店销售某一商品，商店每天公布统一的折扣（discout）。同时允许销售人员在销售时灵活掌握售价（Price），在此基础上，对一次购10件以上者，还
谭浩强 C++面向对象程序设计 118页第9题 9.商店销售某一商品,商店每天公布统一的折扣(discout).同时允许销售人员在销售时灵活掌握售价(Price),在此基础上,对一次购10件以上者,还 ...
谭浩强c++第9章题9商店销售某一商品，商店每天公布统一的折扣（discount）。同时允许销售人员在销售时灵活掌握售价（price），在此基础上，对一次购10件以上者，还可以享受9.8折优惠。
商店销售某一商品,商店每天公布统一的折扣(discount).同时允许销售人员在销售时灵活掌握售价(price),在此基础上,对一次购10件以上者,还可以享受9.8折优惠.现已知当天3名销货员的销售情 ...
商店销售某一商品，每天公布统一的折扣discount，同时允许销售人员在销售时灵活掌握售价price，在此基础上，一次购入10件以上这，还可享受9.8折优惠。
题目商店销售某一商品,每天公布统一的折扣discount,同时允许销售人员在销售时灵活掌握售价price,在此基础上,一次购入10件以上这,还可享受9.8折优惠.现已知当天3个销货员销售情况为: 请 ...

10家不同商店50种不同商品销售量预测数据

预测10家商店未来三个月50种商品的销售量

一.前言

目前拥有10家店50种商品过去5年内的销售量，尝试通过建立ARIMAL，回归，GBDT模型来预测未来一年的销量

时间序列提供了预测未来价值的机会。基于以前的价值观，可以使用时间序列来预测经济，天气和能力规划的趋势。时间序列数据的具体属性意味着通常需要专门的统计方法。

二.数据分析与探索

50种商品在过去5年的销售量表现状况

10家商店在过去5年的销售量表现状况

50种商品销售量与时间的联系

销量与月份的关联性

销量与年份的关联性

销量与月份的关联性

销量与星期X的关联性

三.建模与预测

数据清洗

时间序列

先尝试用随机森林分类建模预测及观察特征

影响销量的特征

使用DecisionTreeRegressor建模，预测销量

使用线性回归建模，预测销量

将两个模型合并，预测销量

直接用GradientBoostingRegressor融合两个模型

三.总结和评估

10家不同商店50种不同商品销售量预测数据相关推荐

最新文章

热门文章

10家不同商店50种不同商品销售量预测数据

预测10家商店未来三个月50种商品的销售量

一.前言

目前拥有10家店50种商品过去5年内的销售量，尝试通过建立ARIMAL，回归，GBDT模型来预测未来一年的销量

时间序列提供了预测未来价值的机会。 基于以前的价值观，可以使用时间序列来预测经济，天气和能力规划的趋势。 时间序列数据的具体属性意味着通常需要专门的统计方法。

二.数据分析与探索

50种商品在过去5年的销售量表现状况

10家商店在过去5年的销售量表现状况

50种商品销售量与时间的联系

销量与月份的关联性

销量与年份的关联性

销量与月份的关联性

销量与星期X的关联性

三.建模与预测

数据清洗

时间序列

先尝试用随机森林分类建模预测及观察特征

影响销量的特征

使用DecisionTreeRegressor建模，预测销量

使用线性回归建模，预测销量

将两个模型合并，预测销量

直接用GradientBoostingRegressor融合两个模型

三.总结和评估

10家不同商店50种不同商品销售量预测数据相关推荐

最新文章

热门文章

时间序列提供了预测未来价值的机会。基于以前的价值观，可以使用时间序列来预测经济，天气和能力规划的趋势。时间序列数据的具体属性意味着通常需要专门的统计方法。