一、dwd层分析

下面我们来创建dwd层:
针对ods层表中的数据进行清洗,参考数据清洗规则,按照实际情况对数据进行清洗。

注意:如果清洗规则使用SQL可以实现,那么就使用SQL实现数据清洗,如果清洗的规则使用SQL实现起来非常麻烦,或者使用SQL压根无法实现,此时就可以考虑需要使用MapReduce代码或者Spark代码对数据进行清洗了。

由于我们这里采集的数据还是比较规整的,可以使用SQL实现,所以我们就直接使用SQL实现数据清洗了。

二、创建dwd层数据库

在hive中创建数据库dwd_mall

create database dwd_mall;
show databases;

三、创建dwd层的表

注意:

1、原始json数据中的用户id字段名称为uid,但是在商品订单数据中用户id字段名称为user_id,这块需要注意一下,在实际工作中会有这种情况,客户端和服务端数据的个别字段名称不一致,所以我们在使用的时候最好是统一一下,后期使用起来比较方便,所以在这里我会通过uid解析数据,解析之后,给字段起别名为user_id。2、hive中的timestamp只能解析yyyy-MM-dd HH:MM:SS格式的数据,所以针对里面的acttime字段我们使用bigint类型。3、为了考虑到SQL重跑的情况,在使用insert into table 的时候最好改为insert overwrite table ,否则SQL重复执行的时候会重复写入数据。

1、dwd_user_active

存储解析清洗之后的用户主动活跃数据,对数据进行去重,并且过滤掉xaid为空的数据。

(1)源表

ods_user_active

(2)建表语句

create external table if not exists dwd_mall.dwd_user_active(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,ad_status    tinyint,loading_time    bigint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/user_active/';

(3)映射关系:插入数据

insert overwrite table dwd_mall.dwd_user_active partition(dt='20220309')  select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.ad_status') as ad_status,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_user_active where dt = '20220309' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';

2、dwd_click_good

(1)源表

ods_click_good

(2)建表语句

create external table if not exists dwd_mall.dwd_click_good(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,goods_id    bigint,location    tinyint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/click_good/';

(3)映射关系

insert overwrite table dwd_mall.dwd_click_good partition(dt='20220309')  select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.location') as location
from
(
select log from ods_mall.ods_click_good where dt = '20220309' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';

3、dwd_good_item

(1)源表

ods_good_item

(2)建表语句

create external table if not exists dwd_mall.dwd_good_item(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,goods_id    bigint,stay_time    bigint,loading_time    bigint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/good_item/';

(3)映射关系

insert overwrite table dwd_mall.dwd_good_item partition(dt='20220309') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.stay_time') as stay_time,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_good_item where dt = '20220309' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';

4、dwd_good_list

(1)源表

ods_good_list

(2)建表语句

create external table if not exists dwd_mall.dwd_good_list(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,loading_time    bigint,loading_type    tinyint,goods_num    tinyint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/good_list/';

(3)映射关系

insert overwrite table dwd_mall.dwd_good_list partition(dt='20220309') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.loading_time') as loading_time,
get_json_object(log,'$.loading_type') as loading_type,
get_json_object(log,'$.goods_num') as goods_num
from
(
select log from ods_mall.ods_good_list where dt = '20220309' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';

5、dwd_app_close

(1)源表

ods_app_close

(2)建表语句

create external table if not exists dwd_mall.dwd_app_close(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/app_close/';

(3)映射关系

insert overwrite table dwd_mall.dwd_app_close partition(dt='20220309') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime
from
(
select log from ods_mall.ods_app_close where dt = '20220309' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';

四、针对dwd层抽取脚本

1、表初始化脚本(初始化执行一次)

dwd_mall_init_table.sh

内容如下:

#!/bin/bash
# dwd层数据库和表初始化脚本,只需要执行一次即可hive -e "
create database if not exists dwd_mall;create external table if not exists dwd_mall.dwd_user_active(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,ad_status    tinyint,loading_time    bigint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/user_active/';create external table if not exists dwd_mall.dwd_click_good(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,goods_id    bigint,location    tinyint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/click_good/';create external table if not exists dwd_mall.dwd_good_item(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,goods_id    bigint,stay_time    bigint,loading_time    bigint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/good_item/';create external table if not exists dwd_mall.dwd_good_list(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint,loading_time    bigint,loading_type    tinyint,goods_num    tinyint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/good_list/';create external table if not exists dwd_mall.dwd_app_close(user_id    bigint,xaid    string,platform    tinyint,ver    string,vercode    string,net    bigint,brand    string,model    string,display    string,osver    string,acttime    bigint
)partitioned by(dt string) row format delimited  fields terminated by '\t'location 'hdfs://bigdata01:9000/data/dwd/app_close/';
"

2、添加分区数据脚本(每天执行一次)

dwd_mall_add_partition.sh

内容如下:

#!/bin/bash
# 基于ods层的表进行清洗,将清洗之后的数据添加到dwd层对应表的对应分区中
# 每天凌晨执行一次# 默认获取昨天的日期,也支持传参指定一个日期
if [ "z$1" = "z" ]
then
dt=`date +%Y%m%d --date="1 days ago"`
else
dt=$1
fihive -e "
insert overwrite table dwd_mall.dwd_user_active partition(dt='${dt}')  select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.ad_status') as ad_status,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_user_active where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';insert overwrite table dwd_mall.dwd_click_good partition(dt='${dt}')  select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.location') as location
from
(
select log from ods_mall.ods_click_good where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';insert overwrite table dwd_mall.dwd_good_item partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.goods_id') as goods_id,
get_json_object(log,'$.stay_time') as stay_time,
get_json_object(log,'$.loading_time') as loading_time
from
(
select log from ods_mall.ods_good_item where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';insert overwrite table dwd_mall.dwd_good_list partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime,
get_json_object(log,'$.loading_time') as loading_time,
get_json_object(log,'$.loading_type') as loading_type,
get_json_object(log,'$.goods_num') as goods_num
from
(
select log from ods_mall.ods_good_list where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';insert overwrite table dwd_mall.dwd_app_close partition(dt='${dt}') select
get_json_object(log,'$.uid') as user_id,
get_json_object(log,'$.xaid') as xaid,
get_json_object(log,'$.platform') as platform,
get_json_object(log,'$.ver') as ver,
get_json_object(log,'$.vercode') as vercode,
get_json_object(log,'$.net') as net,
get_json_object(log,'$.brand') as brand,
get_json_object(log,'$.model') as model,
get_json_object(log,'$.display') as display,
get_json_object(log,'$.osver') as osver,
get_json_object(log,'$.acttime') as acttime
from
(
select log from ods_mall.ods_app_close where dt = '${dt}' group by log
) as tmp
where get_json_object(log,'$.xaid') !='';
"

五、执行脚本

脚本在目录下

执行脚本

sh dwd_mall_init_table.sh
sh dwd_mall_add_partition.sh 20220309


六、验证

连接到hive

hive
show databases;
use dwd_mall;
show tables;

检查是否有表和数据。

select * from dwd_app_close limit 1;select * from dwd_click_good limit 1;select * from dwd_good_item limit 1;select * from dwd_good_list limit 1;select * from dwd_user_active limit 1;

数据仓库之【用户行为数仓】06:【dwd层】明细数据层:清洗ods层之后的数据相关推荐

  1. 数据仓库之【用户行为数仓】08:【dws层:数据汇总层】【appc层:数据应用层】需求1:每日新增用户相关指标

    一.每日新增用户相关指标分析 在统计新增用户时,用户是以设备标识(xaid字段)来判断的,每一个设备都有一个唯一设备码. 因为会存在用户不登录的情况,以及多人共用一个账号的情况,所以根据用户id进行过 ...

  2. 数据仓库之【用户行为数仓】10:【dws层:数据汇总层】【appc层:数据应用层】需求3:用户7日流失push提醒

    一.用户7日流失push提醒分析 什么是流失? 假设这个用户在2026年2月2日是新增用户,如果他在后续的7天内,也就是在2月9日内没有再使用app,则认为是流失用户,具体多少天属于流失用户,这个是需 ...

  3. 数据仓库之【用户行为数仓】11:【dws层:数据汇总层】【appc层:数据应用层】需求4:每日启动App次数相关指标

    一.每日启动App次数相关指标分析 这个需求就是对每日打开app上报的数据进行统计. 针对这个需求我们需要统计两个指标 1.每日人均启动App次数 2.每日APP启动次数分布(1次.2次.3次及以上) ...

  4. 数据仓库之【用户行为数仓】01:项目效果展示、背景

    一.项目效果展示 大家好,下面我们来学习一个电商行业的数据仓库项目 首先看一下项目效果 本身我们这个数据仓库项目其实是一个纯后台项目,不过为了让大家能够更加直观的感受项目的效果,我们可以基于数据仓库中 ...

  5. 58同城用户行为数仓建设及实践

    背景 随着58业务体系的不断建设与发展,数据分析与应用需求越来越丰富,给数据仓库的建设工作带来了很大的挑战. 全站行为数据仓库建设过程中,我们总结的问题包括如下几点: (1) 数据体系架构已经无法支持 ...

  6. 数据仓库之【用户行为数仓】12:【dws层:数据汇总层】【appc层:数据应用层】需求5:操作系统活跃用户相关指标

    一.操作系统活跃用户相关指标分析 这个需求是统计一下我们产品的目前主要用户群体是使用什么类型的操作系统. 因为我们产品的app是有Android端和ios端的. 如果我们的用户80%以上使用的都是An ...

  7. 【实时数仓】DWD层需求分析及实现思路、idea环境搭建、实现DWD层处理用户行为日志的功能

    文章目录 一 DWD层需求分析及实现思路 1 分层需求分析 2 每层的职能 3 DWD层职能详细介绍 (1)用户行为日志数据 (2)业务数据 4 DWD层数据准备实现思路 二 环境搭建 1 创建mav ...

  8. 企业级数据仓库:数据仓库概述;核心技术框架,数仓理论,数据通道Hive技术框架,HBase设计,系统调度,关系模式范式,ER图,维度建模,星型/雪花/星座模式,数据采集同步,业务数据埋点,数据仓库规范

    文章目录 第一章 数据仓库概述 1.1 数据仓库简介 1.1.2 什么是数据仓库? 1.1.3 OLTP 与 OLAP 1.2 数据仓库技术架构 1.3 课程目标 第二章 核心技术框架 2.1 数据仓 ...

  9. 尚硅谷数据仓库实战之3数仓搭建

    尚硅谷数据仓库实战之3数仓搭建 第4章 数仓搭建-ODS层 4.2 ODS层(业务数据) 4.2.1 活动信息表 第5章 数仓搭建-DIM层 5.1 商品维度表(全量) 5.6 用户维度表(拉链表) ...

最新文章

  1. Pytorch两种模型保存方式
  2. QEMU — VirtIO 虚拟化
  3. C语言学习之输入10个数,输出其中最大的一个数。
  4. 【大会】中低端机如何实现复杂多媒体功能?
  5. jquery一些基本函数
  6. android led闪烁功能,如何在Android应用层中制作一个LED指示灯效果
  7. 【Java小项目】简单的天气预报
  8. [HTTP协议] 基础知识
  9. 结对-结对编项目作业名称-测试过程
  10. linux定时重启脚本
  11. 【数学建模】元胞自动机
  12. OBS-Linux直播神器(录屏神器)
  13. 互联网大文件的传输方式
  14. 使用wps把word格式文件转换成pdf文件
  15. 【转载】国医大师熊继柏:用中医思维彻底把新冠病毒中医治疗方案说清楚!值得中医人收藏
  16. 什么句型可以 让我说出 悲伤的文法
  17. 红帽linux9 iso,RedHat Linux9.0 ISO 原版下载
  18. Windows打印机驱动开发
  19. 利用percona-toolkit 工具来检测mysql 主从数据库同步以及实现同步
  20. 首发!2022高考数学压轴题解析!

热门文章

  1. Shell语法(循环+文件读写)
  2. ios app HTML5 白屏,App平台iOS设备上因内存不足导致白屏、闪退的问题解决方案
  3. wordpress代码块不显示$字符
  4. Single Image Portrait Relighting via Explicit Multiple Reflectance Channel Modeling 论文笔记
  5. 10款超好用的开源大数据分析工具
  6. 学python爬虫第三天
  7. 视觉SLAM14讲第三章学习笔记
  8. DOCK软件测试大乐,LeDock分子对接与虚拟筛选性能评测
  9. iphone11支持es6吗_iphone11支持es6吗_为什么宁愿买iPhone 11也不买iPhone12mini,答案很现实...
  10. 【每日早报】2019/11/19