CC00053.pbpositions——|HadoopPB级数仓.V11|——PB数仓.v11|核心交易分析|DIM层建表|加载数据|

一、DIM层建表加载数据：DIM层建表加载数据概述

### --- 首先要确定哪些是事实表、哪些是维表。绿色的是事实表，灰色的维表~~~     用什么方式处理维表，每日快照、拉链表？
~~~     小表使用每日快照：产品分类表、商家店铺表、商家地域组织表、支付方式表
~~~     大表使用拉链表：产品信息表

二、DIM层建表：商品分类表

### --- DIM层建表：商品分类表~~~     数据库中的数据是规范的(满足三范式)，但是规范化的数据给查询带来不便。
~~~     备注：这里对商品分类维度表做了逆规范化
~~~     省略了无关信息，做成了宽表

### --- 商品分类表DROP TABLE IF EXISTS dim.dim_trade_product_cat;create table if not exists dim.dim_trade_product_cat(
firstId int, -- 一级商品分类id
firstName string, -- 一级商品分类名称
secondId int, -- 二级商品分类Id
secondName string, -- 二级商品分类名称
thirdId int, -- 三级商品分类id
thirdName string -- 三级商品分类名称
)
partitioned by (dt string)
STORED AS PARQUET;

### --- 实现：hive (default)> select T1.catid, T1.catname, T2.catid, T2.catname, T3.catid, T3.catname
from (select catid, catname, parentid
from ods.ods_trade_product_category
where level=3 and dt='2020-07-01') T3 left join
(select catid, catname, parentid from ods.ods_trade_product_category
where level=2 and dt='2020-07-01') T2 on T3.parentid=T2.catid left join
(select catid, catname, parentid from ods.ods_trade_product_category
where level=1 and dt='2020-07-01')
T1 on T2.parentid=T1.catid;
~~~输出参数
t1.catid    t1.catname  t2.catid    t2.catname  t3.catid    t3.catname
47  时蔬水果、网上菜场   61  进口水果    72  橙柚
47  时蔬水果、网上菜场   61  进口水果    73  苹果
48  厨卫清洁、纸制用品   74  纸制品 75  软包抽纸
47  时蔬水果、网上菜场   61  进口水果    76  凤梨
47  时蔬水果、网上菜场   61  进口水果    77  火龙果
56  虚拟服务、优惠票券   87  教育培训    88  早教幼教

### --- 数据加载：[root@hadoop02 ~]# vim /data/yanqidw/script/trade/dim_load_product_cat.sh

#！/bin/bashsource /etc/profileif [ -n "$1" ]
thendo_date=$1
elsedo_date=`date -d "-1 day" +%F`
fisql="
insert overwrite table dim.dim_trade_product_cat partition(dt='$do_date')
selectt1.catid, -- 一级分类idt1.catname, -- 一级分类名称t2.catid, -- 二级分类idt2.catname, -- 二级分类名称t3.catid, -- 三级分类idt3.catname -- 三级分类名称
from
-- 商品三级分类数据
(select catid, catname, parentid from ods.ods_trade_product_category
where level=3 and dt='$do_date') t3 left join
-- 商品二级分类数据
(select catid, catname, parentid from ods.ods_trade_product_category
where level=2 and dt='$do_date') t2 on t3.parentid = t2.catid left join
-- 商品一级分类数据
(select catid, catname, parentid from ods.ods_trade_product_category
where level=1 and dt='$do_date') t1 on t2.parentid = t1.catid;
"hive -e "$sql"

~~~     # 加载数据到DIM商品分类表[root@hadoop02 ~]# sh /data/yanqidw/script/trade/dim_load_product_cat.sh  2020-07-01

~~~     # 查看DIM商品分类表是否加载到数据hive (default)> show partitions dim.dim_trade_product_cat;
partition
dt=2020-07-01hive (default)> select count(*) from dim.dim_trade_product_cat where dt='2020-07-01' limit 5;
415

三、DIM层建表：商品地域组织表

~~~     DIM层建表：商品地域组织表
~~~     商家店铺表、商家地域组织表 => 一张维表
~~~     这里也是逆规范化的设计，将商家店铺表、商家地域组织表组织成一张表，并拉宽。
~~~     在一行数据中体现：商家信息、城市信息、地域信息。信息中包括 id 和 name ；

### --- 商品地域组织表drop table if exists dim.dim_trade_shops_org;create table dim.dim_trade_shops_org(
shopid int,
shopName string,
cityId int,
cityName string ,
regionId int ,
regionName string
)
partitioned by (dt string)
STORED AS PARQUET;

### --- 实现hive (default)> select T1.shopid, T1.shopname, T2.id cityid, T2.orgname cityname, T3.id regionid, T3.orgname regionname
from
(select shopid, shopname, areaid
from ods.ods_trade_shops
where dt='2020-07-01') T1
left join
(select id, parentid, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=2 and dt='2020-07-01') T2
on T1.areaid=T2.id
left join
(select id, orgname, orglevel
from ods.ods_trade_shop_admin_org
where orglevel=1 and dt='2020-07-01') T3
on T2.parentid=T3.id
limit 10;
~~~输出参数
t1.shopid   t1.shopname cityid  cityname    regionid    regionname
100050  WSxxx营超市    100225  景德镇市分公司 100006  华北大区
100052  新鲜xxx旗舰店    100236  青岛市分公司  100006  华北大区
100053  华为xxx旗舰店    100011  石家庄市分公司 100006  华北大区
100054  小米xxx旗舰店    100159  石嘴山市分公司 100007  华西大区
100055  苹果xxx旗舰店    100211  六安市分公司  100006  华北大区
100056  OPxxx自营店    100050  阜新市分公司  100006  华北大区
100057  三只xxx鼠零食    100311  东莞市分公司  100008  华南大区
100058  良子xxx铺美食    100329  崇左市分公司  100008  华南大区
100059  乐居xxx日用品    100225  景德镇市分公司 100006  华北大区
100060  同仁xxx大健康    100211  六安市分公司  100006  华北大区

### --- 数据加载[root@hadoop02 ~]# vim /data/yanqidw/script/trade/dim_load_shop_org.sh

#！/bin/bashsource /etc/profileif [ -n "$1" ]
thendo_date=$1
elsedo_date=`date -d "-1 day" +%F`
fisql="
insert overwrite table dim.dim_trade_shops_org partition(dt='$do_date')select t1.shopid,t1.shopname,t2.id as cityid,t2.orgname as cityName,t3.id as region_id,
t3.orgname as region_name from
(select shopId, shopName, areaId from ods.ods_trade_shops where dt='$do_date')
t1 left join
(select id, parentId, orgname, orglevel from ods.ods_trade_shop_admin_org where orglevel=2 and dt='$do_date')
t2 on t1.areaid = t2.id left join
(select id, parentId, orgname, orglevel from ods.ods_trade_shop_admin_org where orglevel=1 and dt='$do_date')
t3 on t2.parentid = t3.id;
"hive -e "$sql"

~~~     # 加载数据到DIM商品低于组织表[root@hadoop02 ~]# sh /data/yanqidw/script/trade/dim_load_shop_org.sh  2020-07-01

~~~     # 查看DIM商品分类表是否加载到数据hive (default)> show partitions dim.dim_trade_shops_org;
partition
dt=2020-07-01hive (default)> select * from dim.dim_trade_shops_org where dt='2020-07-01' limit 5;
dim_trade_shops_org.shopid  dim_trade_shops_org.shopname    dim_trade_shops_org.cityid  dim_trade_shops_org.cityname    dim_trade_shops_org.regionid    dim_trade_shops_org.regionnamedim_trade_shops_org.dt
100050  WSxxx营超市    100225  景德镇市分公司 100006  华北大区    2020-07-01
100052  新鲜xxx旗舰店    100236  青岛市分公司  100006  华北大区    2020-07-01
100053  华为xxx旗舰店    100011  石家庄市分公司 100006  华北大区    2020-07-01
100054  小米xxx旗舰店    100159  石嘴山市分公司 100007  华西大区    2020-07-01
100055  苹果xxx旗舰店    100211  六安市分公司  100006  华北大区    2020-07-01hive (default)> select count(*) from dim.dim_trade_shops_org where dt='2020-07-01' limit 5;
5266

四、DIM层建表：支付方式表

~~~     DIM层建表：支付方式表
~~~     对ODS中表的信息做了裁剪，只保留了必要的信息。

### --- 支付方式表drop table if exists dim.dim_trade_payment;create table if not exists dim.dim_trade_payment(
paymentId string, -- 支付方式id
paymentName string -- 支付方式名称
)
partitioned by (dt string)
STORED AS PARQUET;

### --- 数据加载[root@hadoop02 ~]# vim /data/yanqidw/script/trade/dim_load_payment.sh

#！/bin/bashsource /etc/profileif [ -n "$1" ]
thendo_date=$1
elsedo_date=`date -d "-1 day" +%F`
fisql="
insert overwrite table dim.dim_trade_payment partition(dt='$do_date')
select id, payName from ods.ods_trade_payments
where dt='$do_date';
"hive -e "$sql"

~~~     # 加载数据到DIM支付方式表[root@hadoop02 ~]# sh /data/yanqidw/script/trade/dim_load_payment.sh  2020-07-01

~~~     # 查看DIM商品分类表是否加载到数据hive (default)> show partitions dim.dim_trade_payment;
partition
dt=2020-07-01hive (default)> select * from dim.dim_trade_payment where dt='2020-07-01' limit 10;
dim_trade_payment.paymentid dim_trade_payment.paymentname   dim_trade_payment.dt
1   支付宝 2020-07-01
2   微信支付    2020-07-01
3   余额支付    2020-07-01
4   货到付款    2020-07-01
5   云闪付 2020-07-01
6   美团支付    2020-07-01hive (default)> select count(*) from dim.dim_trade_payment where dt='2020-07-01' limit 5;
6

五、DIM层建表：商品信息表

### --- 使用拉链表对商品信息进行处理。~~~     历史数据 => 初始化拉链表(开始日期：当日；结束日期：9999-12-31)【只执行一次】
~~~     拉链表的每日处理【每次加载数据时处理】
~~~     新增数据。每日新增数据(ODS) => 开始日期：当日；结束日期：9999-12-31
~~~     历史数据。拉链表(DIM) 与 每日新增数据(ODS) 做左连接
~~~     连接上数据。数据有变化，结束日期：当日；
~~~     未连接上数据。数据无变化，结束日期保持不变；

### --- 创建维表
~~~     拉链表要增加两列，分别记录生效日期和失效日期drop table if exists dim.dim_trade_product_info;create table dim.dim_trade_product_info(
`productId` bigint,
`productName` string,
`shopId` string,
`price` decimal,
`isSale` tinyint,
`status` tinyint,
`categoryId` string,
`createTime` string,
`modifyTime` string,
`start_dt` string,
`end_dt` string
) COMMENT '产品表'
STORED AS PARQUET;

### --- 初始数据加载(历史数据加载，只做一次)hive (default)> insert overwrite table dim.dim_trade_product_info
select productId,
productName,
shopId,
price,
isSale,
status,
categoryId,
createTime,
modifyTime,
-- modifyTime非空取modifyTime，否则取createTime；substr取日期
case when modifyTime is not null
then substr(modifyTime, 0, 10)
else substr(createTime, 0, 10)
end as start_dt,
'9999-12-31' as end_dt
from ods.ods_trade_product_info
where dt = '2020-07-12';

~~~     # 查看导入的数据总量hive (default)> select count(*) from dim.dim_trade_product_info;
15807

### --- 增量数据导入(重复执行，每次加载数据执行)[root@hadoop02 ~]# vim /data/yanqidw/script/trade/dim_load_product_info.sh

#！/bin/bashsource /etc/profileif [ -n "$1" ]
thendo_date=$1
elsedo_date=`date -d "-1 day" +%F`
fisql="
insert overwrite table dim.dim_trade_product_info select productId,productName,shopId,price,isSale,status,categoryId,createTime,modifyTime,
case when modifyTime is not null then substr(modifyTime,0,10) else substr(createTime,0,10)
end as start_dt,
'9999-12-31' as end_dt from ods.ods_trade_product_info where dt='$do_date' union all select dim.productId,dim.productName,dim.shopId,dim.price,dim.isSale,dim.status,dim.categoryId,dim.createTime,dim.modifyTime,dim.start_dt,
case when dim.end_dt >= '9999-12-31' and ods.productId is not null
then '$do_date'
else dim.end_dt
end as end_dt
from dim.dim_trade_product_info dim left join
(select *
from ods.ods_trade_product_info
where dt='$do_date' ) ods
on dim.productId = ods.productId
"hive -e "$sql"

~~~     # 加载数据到DIM增量数据导入[root@hadoop02 ~]# sh /data/yanqidw/script/trade/dim_load_product_info.sh  2020-07-01

~~~     # 查看DIM商品分类表是否加载到数据
~~~     查看导入的历史数据总量hive (default)> select count(*) from dim.dim_trade_product_info;
15807hive (default)> select * from dim.dim_trade_product_info limit 5;
dim_trade_product_info.productid    dim_trade_product_info.productname  dim_trade_product_info.shopid   dim_trade_product_info.price    dim_trade_product_info.issale   dim_trade_product_info.status   dim_trade_product_info.categoryid   dim_trade_product_info.createtime   dim_trade_product_info.modifytime   dim_trade_product_info.start_dt dim_trade_product_info.end_dt
100101  四川xxx/个 100056  37  1   0   72  2020-07-12 13:22:22 2020-07-12 13:22:22 2020-07-12  9999-12-31
100102  单果xxx单果 100061  59  1   1   72  2020-07-12 13:22:22 2020-07-12 13:22:22 2020-07-12  9999-12-31
100103  红颜xxx水果 100055  49  1   0   251 2020-07-12 13:22:22 2020-07-12 13:22:22 2020-07-12  9999-12-31
100104  智利xxx水果 100050  78  1   0   72  2020-07-12 13:22:22 2020-07-12 13:22:22 2020-07-12  9999-12-31
100105  Zexxx水果 100062  59  1   1   251 2020-07-12 13:22:22 2020-07-12 13:22:22 2020-07-12  9999-12-31

CC00053.pbpositions——|HadoopPB级数仓.V11|——PB数仓.v11|核心交易分析|DIM层建表|加载数据|相关推荐

CC00068.pbpositions——|HadoopPB级数仓.V02|——|PB数仓.v02|元数据管理工具Atlas|部署配置|账户密码|
一.安装环境准备 ### --- 安装概述说明~~~ 重点讲解Atlas,不对Atlas的依赖组件做讲解,组件均采用单机模式安装. ~~~ 编译才能安装. ### --- 安装依赖~~~ Maven ...
实时数仓与离线数仓总结(一)
精选30+云产品,助力企业轻松上云!>>> 今天主要聊聊数仓的基础知识,分为两篇文章介绍,这是第一篇. 主要内容: 数仓基本概念数仓架构演变实时数仓和离线数仓的区别数仓基本概念 ...
看这篇就明白大数据实时数仓、离线数仓、数据湖之间的关系
数仓架构演变 20世纪70年代,MIT(麻省理工)的研究员致力于研究一种优化的技术架构,该架构试图将业务处理系统和分析系统分开,即将业务处理和分析处理分为不同层次,针对各自的特点采取不同的架构设计原则 ...
数据治理系列：数仓建模之数仓主题与主题域
背景: 数据仓库之父 Bill Inmon 将数据仓库描述为一个面向主题的.集成的.稳定的.反应历史变化的数据集合,用于支持管理者的决策过程. 从上面的引言里面,我们其实可以知道主题在数仓建设里面绝对 ...
数仓调度_数仓调度研究-总论
作为一个IT从业不满三年的菜鸟,可能写类似总结性的文章还是比较费力,但是我还是想尝试写一篇比较出色总结,那么我讲从数据调度N个方面依次详细讲述ETL调度问题. 回答内容如下: 1.什么是数据调度 2. ...
数仓(二)：数仓构建流程、数据中台建设
一.面临的挑战数据中台,目的总结下来一句话:通过服务化的方式增强数据的共享能力以实现数据的复用,解决数据研发.数据分析.数据运营时碰到的痛点问题: 指标口径定义不一致数据研发效率低问题数据质量问 ...
电商离线数仓项目实战（下）
电商离线数仓项目实战(下) 电商分析--核心交易文章目录电商离线数仓项目实战(下) 电商分析--核心交易一.业务需求二.业务数据库表结构 1. 数据库表之间的联系 img 2. 业务数据库-- ...
数仓4.0（三）------数据仓库系统（续）
四:数仓搭建-ODS层首先,先了解一下ODS层的任务即其功能: 1)保持数据原貌不做任何修改,起到备份数据的作用. 2)数据采用LZO压缩,减少磁盘存储空间.100G数据可以压缩到10G以内. 3) ...
数仓学习笔记(4)——数仓搭建（ODS层和DIM层）
目录一.数仓搭建-ODS层 1.ODS层(用户行为数据) 1.1 创建日志表ods_log 1.2 Shell中单引号和双引号区别 1.3 ODS层日志表加载数据脚本 2.ODS层(业务数据) 二. ...

CC00053.pbpositions——|HadoopPB级数仓.V11|——PB数仓.v11|核心交易分析|DIM层建表|加载数据|

CC00053.pbpositions——|HadoopPB级数仓.V11|——PB数仓.v11|核心交易分析|DIM层建表|加载数据|相关推荐

最新文章

热门文章