机器学习训练营（入群联系qq：2279055353）—— 机器学习案例详解的直播互动平台
下期直播案例预告：大数据预测商品的销售量波动趋势

案例介绍

像沃尔玛、家乐福这样的超市零售商，使用销售预报系统和工具补充商品。一套完善的预报系统有助于赢得其它的供应链渠道。如果你擅长预测商品未来一段时间的销量，你就能合理地安排你的库存盘点和分类。

该案例要求根据提供的2013年至2017年商店-商品的销售数据，使用时间序列技术，预测10家不同的商店的50种不同的商品在未来3个月的销售量。

数据来源：Kaggle 竞赛
代码实现：R 语言

数据文件

训练集train.csv与检验集test.csv. 请到以下百度网盘下载：

链接：https://pan.baidu.com/s/1j2FCvikgR2aPkfW2bDtLsA 密码：mmby

加载R包

首先，加载必需的R包。

rm(list=ls())suppressMessages(library(data.table))
suppressMessages(library(DT))
suppressMessages(library(timeSeries))
suppressMessages(library(tidyverse))
suppressMessages(library(reshape))
suppressMessages(library(stringr))
suppressMessages(library(doBy))
suppressMessages(library(formattable))
suppressMessages(library(gridExtra))suppressMessages(library(ggplot2))
suppressMessages(library(plotly))
suppressMessages(library(corrplot))
suppressMessages(library(wesanderson))
suppressMessages(library(RColorBrewer))
suppressMessages(library(gridExtra))
suppressMessages(library(zoo))suppressMessages(library(forecast))
suppressMessages(library(prophet)) set.seed(2018)

加载数据集

train=fread("e:/kaggle_exercises/forecast/input/train.csv")
sprintf("The train data set has %d rows and %d columns", nrow(train), ncol(train) )
str(train)test  <- fread("e:/kaggle_exercises/forecast/input/test.csv")
sprintf("The test data set has %d rows and %d columns", nrow(test), ncol(test) )
str(test)print("the summary of train sales is:")
summary(train$sales)

从变量date里提取“年、月”变量

# Extraction of Year and Month of Year :
train$Year=year(train$date)        #returns the year from date i.e. 2013, 2014 etc.
train$Month=as.yearmon(train$date) #this yearmon() function is coming from zoo package returns the month of an year i.e Jan 2013, Feb 2015 etc
head(train)

缺失值检查

colSums(is.na(train))
# Function 1 : For ploting missing value
plot_missing <- function(data, title = NULL, ggtheme = theme_gray(), theme_config = list("legend.position" = c("bottom"))) {## Declare variable first to pass R CMD checkfeature <- num_missing <- pct_missing <- group <- NULL## Check if input is data.tableis_data_table <- is.data.table(data)## Detect input data classdata_class <- class(data)## Set data to data.tableif (!is_data_table) data <- data.table(data)## Extract missing value distributionmissing_value <- data.table("feature" = names(data),"num_missing" = sapply(data, function(x) {sum(is.na(x))}))missing_value[, feature := factor(feature, levels = feature[order(-rank(num_missing))])]missing_value[, pct_missing := num_missing / nrow(data)]missing_value[pct_missing < 0.05, group := "Good"]missing_value[pct_missing >= 0.05 & pct_missing < 0.4, group := "OK"]missing_value[pct_missing >= 0.4 & pct_missing < 0.8, group := "Bad"]missing_value[pct_missing >= 0.8, group := "Remove"][]## Set data class back to originalif (!is_data_table) class(missing_value) <- data_class## Create ggplot objectoutput <- ggplot(missing_value, aes_string(x = "feature", y = "num_missing", fill = "group")) +geom_bar(stat = "identity") +geom_text(aes(label = paste0(round(100 * pct_missing, 2), "%"))) +scale_fill_manual("Group", values = c("Good" = "#1a9641", "OK" = "#a6d96a", "Bad" = "#fdae61", "Remove" = "#d7191c"), breaks = c("Good", "OK", "Bad", "Remove")) +scale_y_continuous(labels = comma) +coord_flip() +xlab("Features") + ylab("Number of missing rows") +ggtitle(title) +ggtheme + theme_linedraw()+do.call(theme, theme_config)## Print plotprint(output)## Set return objectreturn(invisible(missing_value))
}
plot_missing(train)

可见，数据集里没有缺失值。

特征可视化

我们用图形的方式探索训练集里的特征的分布及变化情况。

销售量直方图

gbp1<-wes_palette("GrandBudapest2")[1]ggplot(train, aes(x=sales))+geom_histogram(fill="#a6d96a", alpha=.9)+labs(x=NULL, y=NULL, title = "Histogram of Sale Price")+# scale_x_continuous(breaks= seq(0,600000, by=100000))+theme_minimal() + theme(plot.title=element_text(vjust=3, size=15))

从直方图可以看出，销售量的分布是正偏的。

日均销售量与改变率

改变率rate定义为

rate=salesi+1−salesisalesi, i=0,1,…,t−1rate=\dfrac{sales_{i+1}-sales_i}{sales_i},\, i=0,1,\dots,t-1rate=salesisalesi+1−salesi,i=0,1,…,t−1

MSP <- aggregate(sales ~date, train, mean)
# MSP <-na.omit(ddply(data, 'date', summarise, mean(Sale_Prices, na.rm=T)))sl1 <-ggplot(MSP, aes(x=as.factor(date), y=sales))+geom_line(color=gbp1, aes(group=1), size=1.5)+geom_point(colour=gbp1, size = 3.5, alpha=0.5)+labs(title="The Growth of Sale Prices by date", x=NULL, y="Sale Price")+theme( plot.title=element_text(vjust=3, size=15) ) + theme_minimal()MSP$rate = c(0, 100*diff(MSP$sales)/MSP[-nrow(MSP),]$sales)sl2 <-ggplot(MSP, aes(x=as.factor(date), y=rate))+geom_line(color= "gray50", aes(group=1), size=1)+#geom_point(colour=gbp1, size = 3.5, alpha=0.5)+labs(title="Change rate of Sale Price", x="date", y="rate of change")+geom_hline(yintercept = 0, color = gbp1 )+theme(plot.title=element_text(size=15))+ theme_minimal()grid.arrange(sl1,sl2)

销售量的改变率关于日期是固定的，我们再看看改变率关于年和月的变化，先来看看关于月份的变化。

MSP <- aggregate(sales ~Month, train, mean)
# MSP <-na.omit(ddply(data, 'date', summarise, mean(Sale_Prices, na.rm=T)))sl1 <-ggplot(MSP, aes(x=as.factor(Month), y=sales))+geom_line(color=gbp1, aes(group=1), size=1.5)+geom_point(colour=gbp1, size = 3.5, alpha=0.5)+labs(title="The Growth of Sale Prices by Month of Year", x=NULL, y="Sale Price")+theme( plot.title=element_text(vjust=3, size=15) ) + theme_minimal()MSP$rate = c(0, 100*diff(MSP$sales)/MSP[-nrow(MSP),]$sales)sl2 <-ggplot(MSP, aes(x=as.factor(Month), y=rate))+geom_line(color= "gray50", aes(group=1), size=1)+#geom_point(colour=gbp1, size = 3.5, alpha=0.5)+labs(title="Change rate of Sale Price", x="Month", y="rate of change")+geom_hline(yintercept = 0, color = gbp1 )+theme(plot.title=element_text(size=15))+ theme_minimal()grid.arrange(sl1,sl2)

改变率关于月份呈周期性的波动变化，再来看看关于年份的变化情况。

MSP <- aggregate(sales ~Year, train, mean)
# MSP <-na.omit(ddply(data, 'date', summarise, mean(Sale_Prices, na.rm=T)))sl1 <-ggplot(MSP, aes(x=as.factor(Year), y=sales))+geom_line(color=gbp1, aes(group=1), size=1.5)+geom_point(colour=gbp1, size = 3.5, alpha=0.5)+labs(title="The Growth of Sale Prices by Year", x=NULL, y="Sale Price")+theme( plot.title=element_text(vjust=3, size=15) ) + theme_minimal()MSP$rate = c(0, 100*diff(MSP$sales)/MSP[-nrow(MSP),]$sales)sl2 <-ggplot(MSP, aes(x=as.factor(Year), y=rate))+geom_line(color= "gray50", aes(group=1), size=1)+#geom_point(colour=gbp1, size = 3.5, alpha=0.5)+labs(title="Change rate of Sale Price", x="Year", y="rate of change")+geom_hline(yintercept = 0, color = gbp1 )+theme(plot.title=element_text(size=15))+ theme_minimal()grid.arrange(sl1,sl2)

我们可以得到4点结论：

销售量逐年增长；
销售量的年改变率呈“高-低”式的波动；
从2013年到2014年，改变率单调增加，而从2014到2015年又单调下降；
最高的增长率出现在2014年；
按这种波动规律，我们推测在2018年改变率是增加的。

不同商店的销售量

我们看看不同的商店的销售量情况，训练集有多少家商店呢？

unique(train$store)

我们再画出这10家商店从2013年到2017年的销售量。

Year_state<-aggregate(sales ~store+Year, train,mean)
pal<-rep(brewer.pal(10, "BrBG"),5)ggplot(Year_state, aes(group = store ))+geom_line(aes(x=Year,y=sales,color=store), alpha=0.5, show.legend=F)+labs(title="The Growth of Sales Price by Store from 2013 - 2017", x=NULL)+theme(panel.background=element_rect(fill = "Black"),plot.title=element_text(vjust=3, size=15),panel.grid.major=element_line(color = pal))

3号商店有最高的年销售量变化，而7号商店的最低。

不同商品的年销售量

训练集train有多少种商品呢？

unique(train$item)

我们看一看这50种商店年销售量的变化趋势。

Year_state<-aggregate(sales ~item+Year, train,mean)
pal<-rep(brewer.pal(10, "BrBG"),5)ggplot(Year_state, aes(group = item ))+geom_line(aes(x=Year,y=sales,color=item), alpha=0.5, show.legend=F)+labs(title="The Growth of Sales Price by Store from 2013 - 2017", x=NULL)+theme(panel.background=element_rect(fill = "Black"),plot.title=element_text(vjust=3, size=15),panel.grid.major=element_line(color = pal))

17号商品年销售量最高，26号商品的最低。

Prophet model

Prophet 介绍

Prophet是在一个加性模型的基础上预报时间序列数据。这个加性模型拟合时间序列里的年、周、日、季节、假期效应等趋势。对于有强季节效应的时间序列，Prophet 模型特别有效。Prophet对缺失数据、季节漂移也是稳健的。一个Prophet模型定义为

y^=trend×(1+multiplicative terms)+additive terms\hat{y}=trend\times(1+multiplicative\,terms)+additive\,termsy^=trend×(1+multiplicativeterms)+additiveterms

Prophet 基础模型

现在，我们产生一个Prophet基础模型：store= 1, Product_ID=1. 即，预报1号商店的1号商品销售量。为此，需要先把这两个变量的观测值进行对数变换。

train_final_store1_item1=subset(train,train$store==1 & train$item==1)stats=data.frame(y=log1p(train_final_store1_item1$sales),ds=train_final_store1_item1$date)
stats=aggregate(stats$y,by=list(stats$ds),FUN=sum)
head(stats)

colnames(stats)<- c("ds","y")
model_prophet = prophet(stats)
summary(model_prophet)future = make_future_dataframe(model_prophet, periods = 90)
forecast = predict(model_prophet, future)

下面，可视化我们的Prophet model的改变点。

add_changepoints_to_plot <- function(m, threshold = 0.01, cp_color = "red",cp_linetype = "dashed", trend = TRUE, ...) {layers <- list()if (trend) {trend_layer <- ggplot2::geom_line(ggplot2::aes_string("ds", "trend"), color = cp_color, ...)layers <- append(layers, trend_layer)}signif_changepoints <- m$changepoints[abs(m$params$delta) >= threshold]cp_layer <- ggplot2::geom_vline(xintercept = as.integer(signif_changepoints), color = cp_color,linetype = cp_linetype, ...)layers <- append(layers, cp_layer)return(layers)
}
plot(model_prophet, forecast)+ add_changepoints_to_plot(model_prophet)

图中红线标记的是改变点，我们需要删除多个改变点，这是由于模型过度拟合造成的。

我们再检查一下模型成分：

prophet_plot_components(model_prophet, forecast)

我们发现，从周日到周一销售量明显下降。因此在销售数据里一定有假日效应。在7月份有一个销售量高峰，这意味着假日时间或者打折促销的影响。综合这些分析，我们认为有必要优化Prophet参数，尝试排除改变点，包括假日效应。

Prophet模型预测商品销售量相关推荐

使用时间序列分解模型预测商品销量
阿里云大学课程:使用时间序列分解模型预测商品销量课程介绍: 商品销量预测是企业制定运营策略的一个重要依据,在激烈的竞争环境中,企业如何预测零售商品的销量并根据销量制定运营策略? 本课程帮助学员了解销 ...
prophet模型预测时间序列
prophet模型预测时间序列 prophet:facebook开源的的一个时间序列预测算法. 原理:基于时间序列分解和机器学习的拟合来做的. 把时间序列分成不同时间间隔和整体趋势的组合. 间隔 ...
阿里云ACA 使用时间序列分解模型预测商品销量（一）
使用时间序列分解模型预测商品销量 1.1实验目的 1.2实验概述 1.3 实验目标 1.4 实验工具 1.5 实验准备实验资源 ============== 这是一条分割线 ============ ...
使用时间序列分解模型预测商品销量(手把手教你如何利用阿里云大数据开发套件进行商品销量的预测)
使用时间序列分解模型预测商品销量 1.1实验目的 1.2实验概述 1.3 实验目标 1.4 实验工具 1.5 实验准备实验资源 ============== 这是一条分割线 ============ ...
阿里云ACA 使用时间序列分解模型预测商品销量（三）
本文接上一篇使用时间序列分解模型预测商品销量---计算回归的差值dQ 发表本博客的目的在于记录笔记,分享经验,无其他任何目的计算回归的差值dQ 使用线性回归方程预测出来的是与时间无关因素对产品销量的 ...
Python ELM模型预测美国10个商店3049个商品销售量 ANN人工神经网络
问题描述美国有10个商店,每个商店有3049个商品,统计了1914天内各个商品的价格和销售数量,以及每天的属性(节日,打折活动等),通过构建非时序模型预测1914天-1941天各个商品销售数量,使用 ...
10家不同商店50种不同商品销售量预测数据
预测10家商店未来三个月50种商品的销售量一.前言目前拥有10家店50种商品过去5年内的销售量,尝试通过建立ARIMAL,回归,GBDT模型来预测未来一年的销量时间序列提供了预测未来价值的机会. ...
python 时间序列prophet 模型分析_手把手教你用Prophet快速进行时间序列预测（附Prophet和R代码）...
原标题:手把手教你用Prophet快速进行时间序列预测(附Prophet和R代码) 作者:ANKIT CHOUDHARY:翻译:王雨桐:校对:丁楠雅: 本文约3000字,建议阅读12分钟. 本文将通过 ...
云音乐用户增长预测之Prophet模型
Prophet模型是Meta公司开发并应用于Facebook/Twitter等产品的开源时间序列模型,适用于MAU/DAU和新增的预测.在云音乐的业务中,此模型不仅能用于DAU预测,也能用于评估活动和 ...
GluonTS 模型预测时间序列
原理模型如下图所示 Zi,t 表示序列i 在第t个时间点的数据 Zi,t0就是要预测数据的开始区间预测的数据集合定义为 [ Zi : t0: T] LSTM输入 Xi,1:T表示再整个预测期 ...

Prophet模型预测商品销售量