AdventureWorksCycle案例分析

项目背景介绍

Adventure Works Cycles，是基于AdventureWorks 示例数据库所虚构的一家大型跨国生产公司。公司生产金属和复合材料的自行车，产品远销北美、欧洲和亚洲市场。公司总部设在华盛顿州的伯瑟尔市，拥有 290 名雇员，而且拥有多个活跃在世界各地的地区性销售团队。

销售报表展示

1.产品介绍

作为自行车生产公司，Adventure Works Cycles 提供以下四类产品：

Adventure Works Cycles 公司生产的自行车。
自行车组件（替换零件），例如，车轮、踏板或刹车部件。
从供应商购买的转售给 Adventure Works Cycles 客户的自行车装饰。
从供应商购买的转售给 Adventure Works Cycles 客户的自行车附件。

2.销售渠道

作为自行车生产公司，Adventure Works Cycles 拥有两类客户，分别代表两种不同的销售渠道：

线上渠道。个人可以从 Adventure Works Cycles 在线商店购买产品。
线下分销商渠道。即从 Adventure Works Cycles 销售代表处购买产品后进行转售的零售店或批发店。

项目需求分析

通过现有数据监控产品销售情况，获取商品销售趋势及区域分布情况，为销售和市场部、生产部提供指导性意见，以期通过增加销售份额，减少生产、库存成本来增加收益。

项目需求实现

1.数据处理

数据源

数据来源于Adventure Works Cycles公司的的样本数据库，包含30个csv文件和1个sql文件，csv文件没有表头，且以 “|” 为分隔符，sql文件包含了所有csv文件的建表语句。

数据格式转换

hive导入数据时默认的分隔符为 “，”，使用python修改csv文件的分隔符

os是操作系统（如Windows）的接口模块，os.walk() 方法是一个简单易用的文件、目录遍历器。

root 指的是当前正在遍历的文件夹的地址
dirs 是一个 list ，内容是该文件夹中所有的目录的名字(不包括子目录)
files 同样是 list , 内容是该文件夹中所有的文件(不包括子目录)
使用 os.path.join(root,file) 方法将文件路径和文件名拼接在一起，得到文件的路径
pd.read_csv读取文件，pd.to_csv输出转换格式后的文件

import pandas as pd
import osfrom_path = r'G:\python_learning\adventure_data'
to_path = r'.\data'if not os.path.exists(to_path):os.mkdir(to_path)# 遍历文件得到文件路径
for root,dirs,files in os.walk(from_path):for file in files:file_path = os.path.join(root,file)# 将缺失值填充为 “null”，pd.read_csv默认将整型数读成浮点数，所以设置dtype=strtry:df=pd.read_csv(file_path,sep='|',encoding='utf-      16LE',header=None,na_values='null',dtype=str) except Exception as e:continue# DataFrame默认sep=‘，’，默认整型数会保存为浮点数，故设置float_format=None,缺失值默认会被忽略，故设置na_rep='null'tofile_path = os.path.join(to_path,file)df.to_csv(tofile_path,index=False,header=False,float_format=None, na_rep='null')

2.数据导入hive

新建create_table.txt文件，将sql文件中的建表语句提取到create_table.txt中
使用python正则解析create_table.txt文件，将30个表的表名和字段名存放到字典中
创建create_table.sh文件，解析字典中的表名和字段名，在sh文件中写入建表语句
创建load_adventure_data.sh 导数脚本文件
在linux系统中执行shell脚本文件，完成hive建表导数

create_table.txt中建表语句如下页所示

CREATE TABLE [dbo].[DimCurrency]([CurrencyKey] [int] IDENTITY(1,1) NOT NULL,[CurrencyAlternateKey] [nchar](3) NOT NULL,[CurrencyName] [nvarchar](50) NOT NULL
) ON [PRIMARY];
GO

正则解析建表语句结构

import re
create_file=open(r"./create_table.txt",encoding='utf-16LE')
table_info = {}
content=create_file.readline()# 不是空行
while len(content)!=0:table_name = ''table_columns = []# 一个建表语句没结束while "GO" not in content:if "CREATE TABLE" in content.upper():# 正则表达式，获取表名se0bj = re.search(r"\[(.*?)\].\[(.*?)\]",content,re.I)  # re.I 对大小写不敏感；（.*?）用于分组if se0bj:table_name = se0bj.group(2)matOjb = re.search(r"\[(.*?)\] \[(.*?)\].*",content.lstrip(),re.I)  #  读取字段名和字段类型，lstrip()去掉字符串左边的空格if matOjb:column = matOjb.group(1)if column.upper() == "DATE":column = "date_time"type = matOjb.group(2)table_columns.append([column,type])# 建表语句没结束，就循环读取（一个建表语句内部循环，获得列名）content = create_file.readline()# 指定 table_info 的键值对table_info[table_name]=table_columns  # 建表语句间的循环，获得表名content=create_file.readline()create_file.close()
table_info

{'DimAccount': [['AccountKey', 'int'],['ParentAccountKey', 'int'],['AccountCodeAlternateKey', 'int'],['ParentAccountCodeAlternateKey', 'int']...

创建hive建表脚本文件
1. hive -e “sql语句”，不进入 hive 的交互窗口执行 sql 语句，hive -v，冗余模式，额外打印执行的sql语句。
2. #!/bin/sh 是shell脚本的执行语句，脚本文件第一行通过注释的方式（#！）指明执行脚本的程序。

# 解析 table_info 字典，创建建表脚本
shell_file = open(r"create_table.sh","w")
shell_file.writelines("#!/bin/sh\n\nhive -v -e\"\nuse adventure_ods;\n\n")for key in table_info.keys():# 写入表头shell_file.write("create table if not exists  " + key + "(\n")   # 写入列地段for columns in table_info[key]:if columns == table_info[key][len(table_info[key])-1]:shell_file.write("  " + columns[0] + "      string\n")else:shell_file.write("  "+columns[0]+"      string,\n")   # 除了最后一列，前面每列建表语句加“，”# 写入建表格式shell_file.write(""")row format delimited fields terminated by ',' stored as textfile;\n\n""" )shell_file.write("\";")
shell_file.close()

创建hive导数脚本文件

# 导入数据语句
load_file = open("load_adventure_data.sh","w")
load_file.writelines("#!/bin/sh\n\nhive -v -e\"\nuse adventure_ods;\n\n")for key in table_info.keys():load_file.write("load data local inpath \'/opt/module/adventure_data/%s.csv\' overwrite into table %s;\n"%(key,key))load_file.write("\n\";")
load_file.close()

hive建表导数

登录linux，远程上传csv数据文件，新建hive数据库 adventure_ods 用来存放基础层数据

hive -e"create database adventure_ods;"

执行.sh文件，完成建表导数任务

# linux 执行shell文件命令，sh .sh
sh create_table.sh
sh load_adventure_data.sh

3.数据探索与数据仓库搭建

数据探索

adventure_ods数据库有29张基础数据表，大致可分为维度表和事实表两类。每个维度表中有一个维作为主键，所有这些维的主键结合成事实表的主键。两类表通过主键连接，并构成星型模型关系。

维度表：每个维度表中都会有一个维作为主键，表示该表的属性。比如地理位置维度表，包含地理位置id、城市、州/省代码、州/省名称、国家/地区代码等描述信息；产品维度表，包含产品id、产品名称、颜色、尺寸、重量等描述信息。

事实表：事实表中包含的一般是数值或其他可以进行计算汇总的数据。比如线上销售事实表，包含客户id、订单日期、订单编号、订单金额、下单量等信息。

表格信息

明确分析目标

1）根据项目需求与现有数据，明确分析目标是展示产品的销售情况；

2）整合数据仓库的数据，构建E-R图，挖掘销售事实表与各维度表的关联；

3）构建与销售相关的指标体系。

数据表关联

ER图

通过E-R图可以清晰的分析出事实表与各维度表之间的关联。线上销售渠道factinternetsales事实表中productkey、customerkey、salesterritorykey等字段与维度表有关联，产品相关的3个维度表之间存在一定关联，地区相关的两个维度表同样存在关联。

factresellersales事实表与factinternetsales事实表的区别在于，线上销售的每一笔订单都是直接面向终端客户，因此通过customerkey与dimcustomer维度表关联。而线下销售是通过经销商进行售卖，每一笔订单都有记录经销商和销售人员的信息，因此通过resellerkey与dimreseller维度表关联，通过employee与dimemployee关联。

指标体系

1）分析维度：

时间维度——年、季度、月、周、日
地区维度——销售大区、国家、州/省、城市
产品维度——产品类别、产品子类
客户维度
经销商维度

2）分析指标：

总销售额
总订单量
总成本
总利润=总销售额-总成本
利润率=总利润/总销售额
客单价=总销售额/客户总数
不同维度（时间、地区、产品）下的销售额、订单量
建立数据仓库

数据仓库的设计分为两层，ODS 基础层，存放基础数据，即前面使用shell脚本导入的数据，存放在adventure_ods库中；一个是 DW 汇总层，用来存放对基础层数据进行加工后生成的数据。

前面已经从实际业务出发，分析了网络销售事实表（factinternetsales）、经销商销售事实表（factresellersales）与各维度表之间的关联，并且罗列出销售方面的关键分析指标。接下来需要建立一个汇总层，用于存放加工后的维度表以及新建的销售数据汇总表。

#创建汇总层数据库adventure_dw
hive -e "create database adventure_dw;"

汇总数据表

由于基础数据中存在很多冗余的数据，会降低数据加载速率及数据处理效率，所以这里的加工包括两个层面，一个是对相同类型的维度表做连接，减少表的数量；另一个是筛选过滤，提取需要分析的关键字段。

另外，这里对网络销售事实表（factinternetsales）和经销商销售事实表（factresellersales）进行整合，提取需要分析的字段（销售额、产品标准成本、运费、税费等），并且创建新的字段（成本、利润等），以便全面分析线上和线下的销售情况。

聚合产品维度表

连接三个与产品相关的维度表：产品维度表（dimproduct）、产品子类别维度表（dimproductsubcategory）、产品类别维度表（dimproductcategory）。提取需要使用的字段：产品id、产品名称、产品类别id、产品类别名称、产品子类id、产品子类名称。

create table product_dw as
select
a.productkey,a.englishproductname,
b.productcategorykey,c.englishproductcategoryname,
a.productsubcategorykey,b.englishproductsubcategoryname
from
adventure_ods.dimproduct a
left join
adventure_ods.dimproductsubcategory b
on a.productsubcategorykey=b.productsubcategorykey
left join
adventure_ods.dimproductcategory c
on b.productcategorykey= c.productcategorykey;

聚合地区维度表

连接两个与区域相关的维度表：区域维度表（dimsalesterritory）、地理位置维度表（dimgeography）。提取需要使用的字段：区域id、销售大区、销售国家、销售地区、州/省、地理位置id、城市。

create table territory_dw as
select a.salesterritorykey,a.salesterritorygroup,a.salesterritorycountry,a.salesterritoryregion,b.stateprovincename,b.geographykey,b.city
from
adventure_ods.dimsalesterritory a
left join
adventure_ods.dimgeography b
on a.salesterritorykey=b.salesterritorykey;

加工客户维度表

客户维度表（dimcustomer）：客户id、地理位置id、性别、婚姻状况，并且这里对出生日期（birthdate）字段进行处理，生成新的字段：年龄（age）。

create table customer_dw as
select
customerkey,geographykey,gender,maritalstatus,
year(current_date())-year(birthdate) as age
from adventure_ods.dimcustomer a;

加工经销商维度表

经销商维度表（dimreseller）：经销商id、地理位置id、经销商名称、年销售额、年收入、开业年份。

create table reseller_dw as
select resellerkey,resellername,geographykey,
annualsales,annualrevenue,yearopened
from adventure_ods.dimreseller;

聚合事实表

由于销售事实表中与地区维度表关联的主键是salesterritorykey，如果两者直接通过主键关联，属于一对多的关系，会出现局部笛卡尔乘积现象，而客户维度表与地区表维度表通过geographykey关联，属于一一对应关系。因此此处先将客户维度表与地区聚合表、线上销售事实表进行聚合，得到销售表customer_sales。同理，先将经销商维度表与地区聚合表、经销商销售事实表聚合。得到销售表reseller_sales。

创建销售汇总表sales_total_dw，使用union all连接线上销售表customer_sales和经销商销售表reseller_sales(注意union all连接的两个表，列名、列数及列字段的顺序必须完全一致，否则会报错或打乱列字段)新增一个标签sales_channel（销售渠道），线上为“internet”，线下为“reseller”。

# 创建线上销售表customer_sales
create table customer_sales as select
a.productkey,substr(a.orderdate,1,10) orderdate,
a.salesordernumber,a.orderquantity,a.unitprice,a.extendedamount,a.unitpricediscountpct,a.discountamount,a.productstandardcost,a.totalproductcost,a.salesamount,a.taxamt,a.freight,
b.customerkey,
c.*
from adventure_ods.factinternetsales a join customer_dw b
on a.customerkey=b.customerkey
join territory_dw c
on b.geographykey = c.geographykey;

# 创建经销商销售表reseller_sales
create table reseller_sales as select
a.productkey,substr(a.orderdate,1,10) orderdate,
a.salesordernumber,a.orderquantity,a.unitprice,a.extendedamount,a.unitpricediscountpct,a.discountamount,a.productstandardcost,a.totalproductcost,a.salesamount,a.taxamt,a.freight,
b.resellerkey,b.resellername,
c.*
from adventure_ods.factresellersales a join reseller_dw b
on a.resellerkey=b.resellerkey
join territory_dw c
on b.geographykey = c.geographykey;

# 连接线上销售表customer_sales和经销商销售表reseller_sales
create table sales_total_dw
as select
a.productkey,...,
'null' as resellerkey,'null' as resellername,
'internet' as sales_channel,
...from customer_sales a
union all
select
b.productkey,...,
'null' as customerkey,
b.resellerkey,b.resellername,
'reseller' as sales_channel,
...
from reseller_sales b;

数据清洗

将聚合的销售表sales_total_dw中的日期、金额等数据类型由string修改为对应的date、float类型。

4.可视化展示

连接hive数据库

启动hadoop

# 进入hadoop文件目录下，输入linux命令启动hadoop
sbin/start-dfs.sh

启动hiveserver2

# 进入hive文件目录下，输入linux命令启动hiveserver2
bin/hive --service hiveserver2

打开tableau，使用cloudera hadoop，远程连接hiveserver2

可视化报表制作

1). 关联数据表

将加工后的销售汇总表sales_total_dw与聚合的产品维度表进行join关联，如下：

2). 提取数据

tableau通过远程连接hive数据库获取到数据后，可以对关联后的数据表进行数据提取操作，后续就不用远程连接，直接本地打开工作簿就可以操作。不仅可以避免远程连接繁琐的程序，还能加快数据处理效率。因为远程连接到的数据库属于公域数据库，同时承载大量的计算和检索，会大大降低数据处理效率，另外远程连接还会出现延迟反应。

3). 报表制作

可视化工具：这里用到的可视化工具有折线图、柱状图、折线-柱形组合图、环形图、堆叠图、地图、仪表盘等。可以根据需要选择图例、轴、列，以及设置数据处理方式。
创建计算字段和快速计算表等方式对数据进行聚合。
筛选器：根据需要设置筛选器，这里用于日期、区域、产品类目等字段的筛选；添加上下文筛选器为筛选设置优先级顺序；使用高级筛选功能，将按钮和仪表盘操作功能结合使用，用于制作导航栏和实现动态联动效果。
仪表盘：根据实际需求使用布局功能合理布局仪表盘，使用操作功能实现仪表盘动态效果。

4). 报表展示

报表一共有3页，包括主页、时间趋势图、区域分布图，如下所示：

思考及改进

上述项目是基于Adventure Works 示例数据库所虚构的业务场景，所以数据是存放在本地，而不是服务器的数据库中。而在实际业务场景中，数据会实时更新并存放在公司基础层数据库中，便于实施监控。所以我从实际业务场景出发针对该项目做了一些改进以应对公司的实际需求。

linux定时部署

每日数据更新后，数据会存放在基础层数据库中，所以可以将聚合表语句写入shell脚本中，定时执行脚本，实现数据表自动更新，并由此可实现自动生成日报，周报等报表对数据进行实时监控。

建表脚本

创建create_customer_sales.sh、create_reseller_sales.sh、create_sales_total_dw.sh 聚合表语句脚本。如下所示：

#!/bin/shhive -v -e "
use adventure_dw;drop table if exists customer_dw;
create table customer_dw as
select
customerkey,geographykey,gender,maritalstatus,
year(current_date())-year(birthdate) as age
from adventure_ods.dimcustomer a;drop table if exists customer_sales;
create table customer_sales as select
a.productkey,substr(a.orderdate,1,10) orderdate,
a.salesordernumber,a.orderquantity,a.unitprice,a.extendedamount,a.unitpricediscountpct,a.discountamount,a.productstandardcost,a.totalproductcost,a.salesamount,a.taxamt,a.freight,
b.customerkey,
c.*
from adventure_ods.factinternetsales a join customer_dw b
on a.customerkey=b.customerkey
join territory_dw c
on b.geographykey = c.geographykey;
"

…

编写执行脚本

编写schedule.sh文件，按顺序执行添加文件：

#!/bin/sh
sh /opt/module/adventure_data/create_customer_sales.sh
sh /opt/module/adventure_data/create_reseller_sales.sh
sh /opt/module/adventure_data/create_sales_total_dw.sh

定时部署

在linux服务器上做定时部署,利用服务器，每日24时自动执行代码，完成表的聚合工作，实现数据自动化更新。 cron是linux内置的一个定时执行工具，cron进程能实现定时任务，可以在无需人工干预的情况下运行作业。

# 将定时执行语句写入crontab文件
vi /etc/crontab

SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root# For details see man 4 crontabs# Example of job definition:
# .---------------- minute (0 - 59)
# |  .------------- hour (0 - 23)
# |  |  .---------- day of month (1 - 31)
# |  |  |  .------- month (1 - 12) OR jan,feb,mar,apr ...
# |  |  |  |  .---- day of week (0 - 6) (Sunday=0 or 7) OR sun,mon,tue,wed,thu,fri,sat
# |  |  |  |  |
# *  *  *  *  * user-name  command to be executed59 23 * * * root /opt/module/adventure_data/schedule.sh