电商数仓(dwd 层)
一、dwd 层介绍
1、对用户行为数据解析。
2、对核心数据进行判空过滤。
3、对业务数据采用维度模型重新建模,即维度退化。
二、dwd 层用户行为数据
2.1 用户行为启动表 dwd_start_log
1、数据来源
ods_start_log -> dwd_start_log
2、表的创建
drop table if exists dwd_start_log;
CREATE EXTERNAL TABLE dwd_start_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `entry` string,`open_ad_type` string, `action` string, `loading_time` string, `detail` string, `extend1` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_start_log/'
TBLPROPERTIES('parquet.compression'='lzo');
数据由 parquet 存储,再由 lzo 压缩,数据采用 parquet 存储方式,是可以支持切片的,不需要再对数据创建索引。parquet 存储不仅压缩效率高,而且查询速度也快。
3、加载数据
insert overwrite table dwd_start_log
PARTITION (dt='2020-03-10')
select get_json_object(line,'$.mid') mid_id, get_json_object(line,'$.uid') user_id, get_json_object(line,'$.vc') version_code, get_json_object(line,'$.vn') version_name, get_json_object(line,'$.l') lang, get_json_object(line,'$.sr') source, get_json_object(line,'$.os') os, get_json_object(line,'$.ar') area, get_json_object(line,'$.md') model, get_json_object(line,'$.ba') brand, get_json_object(line,'$.sv') sdk_version, get_json_object(line,'$.g') gmail, get_json_object(line,'$.hw') height_width, get_json_object(line,'$.t') app_time, get_json_object(line,'$.nw') network, get_json_object(line,'$.ln') lng, get_json_object(line,'$.la') lat, get_json_object(line,'$.entry') entry, get_json_object(line,'$.open_ad_type') open_ad_type, get_json_object(line,'$.action') action, get_json_object(line,'$.loading_time') loading_time, get_json_object(line,'$.detail') detail, get_json_object(line,'$.extend1') extend1
from ods_start_log where dt='2020-03-10';
4、dwd 层用户行为启动表脚本 ods_to_dwd_log.sh
#!/bin/bash # 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then do_date=$1
else do_date=`date -d "-1 day" +%F`
fi sql="
insert overwrite table "$APP".dwd_start_log PARTITION (dt='$do_date')
select get_json_object(line,'$.mid') mid_id, get_json_object(line,'$.uid') user_id, get_json_object(line,'$.vc') version_code, get_json_object(line,'$.vn') version_name, get_json_object(line,'$.l') lang, get_json_object(line,'$.sr') source, get_json_object(line,'$.os') os, get_json_object(line,'$.ar') area, get_json_object(line,'$.md') model, get_json_object(line,'$.ba') brand, get_json_object(line,'$.sv') sdk_version, get_json_object(line,'$.g') gmail, get_json_object(line,'$.hw') height_width, get_json_object(line,'$.t') app_time, get_json_object(line,'$.nw') network, get_json_object(line,'$.ln') lng, get_json_object(line,'$.la') lat, get_json_object(line,'$.entry') entry, get_json_object(line,'$.open_ad_type') open_ad_type, get_json_object(line,'$.action') action, get_json_object(line,'$.loading_time') loading_time, get_json_object(line,'$.detail') detail, get_json_object(line,'$.extend1') extend1
from "$APP".ods_start_log where dt='$do_date';"
$hive -e "$sql"
2.2 用户行为事件表数据
1、数据来源及数据拆分
2、创建基础明细表 dwd_base_event_log
明细表用于存储 ODS 层原始表转换过来的明细数据。
(1) 建表
drop table if exists dwd_base_event_log;
CREATE EXTERNAL TABLE dwd_base_event_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string,`brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `event_name` string, `event_json` string, `server_time` string
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_base_event_log/'
TBLPROPERTIES('parquet.compression'='lzo');
说明:其中 event_name 和 event_json 用来对应事件名和整个事件。这个地方将原始日志 1 对多的形式拆分出来了。操作的时候我们需要将原始日志展平,需要用到 UDF 和 UDTF。
(2) 自定义 udf(base_analize)
udf 函数特点:一行进一行出。
A、思路
B、代码
package com.atguigu.udf;import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONException;
import org.json.JSONObject;/*** @description: 自定义 UDF 用于解析公共字段* @author: hyr* @time: 2020/4/21 14:00*/
public class BaseFieldUDF extends UDF {public String evaluate(String line, String key) throws JSONException {String[] log = line.split("\\|");if (log.length != 2 || StringUtils.isBlank(log[1])){return "";}JSONObject basejson = new JSONObject(log[1].trim());String result = "";// 获取服务器时间if ("st".equals(key)){result = log[0].trim();}else if ("et".equals(key)){// 获取事件数组if (basejson.has("et")){result = basejson.getString("et");}}else {JSONObject cm = basejson.getJSONObject("cm");// 获取 key 对应的公共字段的 valueif (cm.has(key)){result = cm.getString(key);}}return result;}/*** 用于数据测试*/public static void main(String[] args) throws JSONException {String line = "1583769785914|{\"cm\":{\"ln\":\"-79.5\",\"sv\":\"V2.3.1\",\"os\":\"8.2.5\",\"g\":\"1MF7Y4W4@gmail.com\",\"mid\":\"0\",\"nw\":\"WIFI\",\"l\":\"en\",\"vc\":\"14\",\"hw\":\"640*1136\",\"ar\":\"MX\",\"uid\":\"0\",\"t\":\"1583695967769\",\"la\":\"-45.9\",\"md\":\"HTC-13\",\"vn\":\"1.1.5\",\"ba\":\"HTC\",\"sr\":\"K\"},\"ap\":\"app\",\"et\":[{\"ett\":\"1583700386001\",\"en\":\"newsdetail\",\"kv\":{\"entry\":\"1\",\"goodsid\":\"0\",\"news_staytime\":\"12\",\"loading_time\":\"15\",\"action\":\"2\",\"showtype\":\"1\",\"category\":\"16\",\"type1\":\"102\"}},{\"ett\":\"1583706290595\",\"en\":\"notification\",\"kv\":{\"ap_time\":\"1583702836945\",\"action\":\"3\",\"type\":\"4\",\"content\":\"\"}},{\"ett\":\"1583681747595\",\"en\":\"active_foreground\",\"kv\":{\"access\":\"1\",\"push_id\":\"3\"}},{\"ett\":\"1583725227310\",\"en\":\"active_background\",\"kv\":{\"active_source\":\"1\"}},{\"ett\":\"1583743888737\",\"en\":\"comment\",\"kv\":{\"p_comment_id\":0,\"addtime\":\"1583697745229\",\"praise_count\":901,\"other_id\":5,\"comment_id\":2,\"reply_count\":163,\"userid\":9,\"content\":\"诸帛咕死添共项饶伞锯产荔讯胆遇卖吱载舟沮稀蓟\"}}]}";String mid = new BaseFieldUDF().evaluate(line, "et");System.out.println(mid);}
}
(3) 自定义 udtf(flat_analizer)
udtf 函数特点:一行进多行出。
A、思路
B、代码
package com.atguigu.udtf;import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.json.JSONArray;
import org.json.JSONException;import java.util.ArrayList;/*** @description: 自定义 udtf 用于展开业务字段* @author: hyr* @time: 2020/4/21 14:44*/
public class EventJsonUDTF extends GenericUDTF {// 该方法中,我们将指定输出参数的名称和参数类型public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException{ArrayList<String> fieldNames = new ArrayList<>();ArrayList<ObjectInspector> fieldOIs = new ArrayList<>();fieldNames.add("event_name");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);fieldNames.add("event_json");fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);}// 输入 1 条记录,输出若干条记录@Overridepublic void process(Object[] objects) throws HiveException {// 获取传入的 etString input = objects[0].toString();// 如果传入的数据为空,直接返回过滤掉该数据if (StringUtils.isBlank(input)){return;}else {// 获取一共有几个事件try {JSONArray ja = new JSONArray(input);if (ja == null){return;}// 循环遍历每一个事件for (int i = 0; i < ja.length(); i++) {String[] result = new String[2];try {// 取出每个事件名称result[0] = ja.getJSONObject(i).getString("en");// 取出每一个事件整体result[1] = ja.getString(i);} catch (JSONException e) {continue;}// 将结果返回forward(result);}} catch (JSONException e) {e.printStackTrace();}}}// 当没有记录处理的时候该方法会被调用,用来清理代码或者产生额外的输出@Overridepublic void close() throws HiveException {}
}
(4) 解析事件日志基础明细表
insert overwrite table dwd_base_event_log partition(dt='2020-03-10')
select base_analizer(line,'mid') as mid_id, base_analizer(line,'uid') as user_id, base_analizer(line,'vc') as version_code, base_analizer(line,'vn') as version_name, base_analizer(line,'l') as lang, base_analizer(line,'sr') as source, base_analizer(line,'os') as os, base_analizer(line,'ar') as area, base_analizer(line,'md') as model, base_analizer(line,'ba') as brand, base_analizer(line,'sv') as sdk_version, base_analizer(line,'g') as gmail, base_analizer(line,'hw') as height_width, base_analizer(line,'t') as app_time, base_analizer(line,'nw') as network, base_analizer(line,'ln') as lng, base_analizer(line,'la') as lat, event_name, event_json, base_analizer(line,'st') as server_time
from ods_event_log lateral view flat_analizer(base_analizer(line,'et')) tmp_flat as event_name,event_json
where dt='2020-03-10' and base_analizer(line,'et')<>'';
(5) 事件日志解析脚本 ods_to_dwd_base_log.sh
#!/bin/bash # 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;thendo_date=$1
else do_date=`date -d "-1 day" +%F`
fisql=
"
INSERT overwrite table "$APP".dwd_base_event_log partition(dt='$do_date')
selectbase_analizer(line,'mid') as mid_id, base_analizer(line,'uid') as user_id, base_analizer(line,'vc') as version_code, base_analizer(line,'vn') as version_name, base_analizer(line,'l') as lang, base_analizer(line,'sr') as source, base_analizer(line,'os') as os, base_analizer(line,'ar') as area, base_analizer(line,'md') as model, base_analizer(line,'ba') as brand, base_analizer(line,'sv') as sdk_version, base_analizer(line,'g') as gmail, base_analizer(line,'hw') as height_width, base_analizer(line,'t') as app_time, base_analizer(line,'nw') as network, base_analizer(line,'ln') as lng,base_analizer(line,'la') as lat,event_name,event_json,base_analizer(line,'st') as server_time
from"$APP".ods_event_log lateral view flat_analizer(base_analizer(line, 'et')) tmp_flat as event_name,event_json
wheredt='$do_date' and base_analizer(line,'et')<>'';
"$hive -e "$sql"
3、商品点击表 dwd_display_log
(1) 建表
drop table if exists dwd_display_log;
CREATE EXTERNAL TABLE dwd_display_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `action` string, `goodsid` string, `place` string, `extend1` string, `category` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_display_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_display_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.goodsid') goodsid, get_json_object(event_json,'$.kv.place') place, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.category') category, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='display';
4、商品详情页表 dwd_newsdetail_log
(1) 建表
drop table if exists dwd_newsdetail_log;
CREATE EXTERNAL TABLE dwd_newsdetail_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `entry` string, `action` string, `goodsid` string, `showtype` string, `news_staytime` string,`loading_time` string, `type1` string, `category` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_newsdetail_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_newsdetail_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.goodsid') goodsid, get_json_object(event_json,'$.kv.showtype') showtype, get_json_object(event_json,'$.kv.news_staytime') news_staytime, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.type1') type1, get_json_object(event_json,'$.kv.category') category, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='newsdetail';
5、商品列表页表 dwd_loading_log
(1) 建表
drop table if exists dwd_loading_log;
CREATE EXTERNAL TABLE dwd_loading_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string,`model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `action` string, `loading_time` string, `loading_way` string, `extend1` string, `extend2` string, `type` string, `type1` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_loading_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_loading_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.loading_way') loading_way, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.extend2') extend2, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.type1') type1, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='loading';
6、广告表 dwd_ad_log
(1) 建表
drop table if exists dwd_ad_log;
CREATE EXTERNAL TABLE dwd_ad_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `entry` string, `action` string, `contentType` string, `displayMills` string, `itemId` string, `activityId` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_ad_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_ad_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action,get_json_object(event_json,'$.kv.contentType') contentType, get_json_object(event_json,'$.kv.displayMills') displayMills, get_json_object(event_json,'$.kv.itemId') itemId, get_json_object(event_json,'$.kv.activityId') activityId, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='ad';
7、消息通知表 dwd_notification_log
(1) 建表
drop table if exists dwd_notification_log;
CREATE EXTERNAL TABLE dwd_notification_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `action` string, `noti_type` string, `ap_time` string, `content` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_notification_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_notification_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network,lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.noti_type') noti_type, get_json_object(event_json,'$.kv.ap_time') ap_time, get_json_object(event_json,'$.kv.content') content, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='notification';
8、用户后台活跃表 dwd_active_background
(1) 建表
drop table if exists dwd_active_background_log;
CREATE EXTERNAL TABLE dwd_active_background_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `active_source` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_background_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_active_background_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model,brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.active_source') active_source, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='active_background';
9、评论表 dwd_comment_log
(1) 建表
drop table if exists dwd_comment_log;
CREATE EXTERNAL TABLE dwd_comment_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `comment_id` int, `userid` int, `p_comment_id` int, `content` string, `addtime` string, `other_id` int, `praise_count` int, `reply_count` int, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_comment_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_comment_log PARTITION (dt='2020-03-10')
selectmid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.comment_id') comment_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.p_comment_id') p_comment_id, get_json_object(event_json,'$.kv.content') content, get_json_object(event_json,'$.kv.addtime') addtime, get_json_object(event_json,'$.kv.other_id') other_id, get_json_object(event_json,'$.kv.praise_count') praise_count, get_json_object(event_json,'$.kv.reply_count') reply_count, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='comment';
10、收藏表 dwd_favorites_log
(1) 建表
drop table if exists dwd_favorites_log;
CREATE EXTERNAL TABLE dwd_favorites_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `id` int, `course_id` int, `userid` int, `add_time` string,`server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_favorites_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_favorites_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.course_id') course_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.add_time') add_time, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='favorites';
11、点赞表 dwd_praise_log
(1) 建表
drop table if exists dwd_praise_log;
CREATE EXTERNAL TABLE dwd_praise_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string, `lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string,`app_time` string, `network` string, `lng` string, `lat` string, `id` string, `userid` string, `target_id` string, `type` string, `add_time` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_praise_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_praise_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.target_id') target_id, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.add_time') add_time, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='praise';
12、错误日志表 dwd_error_log
(1) 建表
drop table if exists dwd_error_log;
CREATE EXTERNAL TABLE dwd_error_log( `mid_id` string, `user_id` string, `version_code` string, `version_name` string,`lang` string, `source` string, `os` string, `area` string, `model` string, `brand` string, `sdk_version` string, `gmail` string, `height_width` string, `app_time` string, `network` string, `lng` string, `lat` string, `errorBrief` string, `errorDetail` string, `server_time` string
)
PARTITIONED BY (dt string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_error_log/'
TBLPROPERTIES('parquet.compression'='lzo');
(2) 加载数据
insert overwrite table dwd_error_log PARTITION (dt='2020-03-10')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.errorBrief') errorBrief, get_json_object(event_json,'$.kv.errorDetail') errorDetail, server_time
from dwd_base_event_log
where dt='2020-03-10' and event_name='error';
13、dwd 层事件表加载数据脚本 ods_to_dwd_event_log.sh
#!/bin/bash
# 定义变量方便修改
APP=gmall
hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$1" ] ;then do_date=$1
else do_date=`date -d "-1 day" +%F`
fi sql="
insert overwrite table "$APP".dwd_display_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.goodsid') goodsid, get_json_object(event_json,'$.kv.place') place, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.category') category, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='display';insert overwrite table "$APP".dwd_newsdetail_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width,app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.goodsid') goodsid, get_json_object(event_json,'$.kv.showtype') showtype, get_json_object(event_json,'$.kv.news_staytime') news_staytime, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.type1') type1, get_json_object(event_json,'$.kv.category') category, server_time
from "$APP".dwd_base_event_log
wheredt='$do_date' and event_name='newsdetail';insert overwrite table "$APP".dwd_loading_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.loading_time') loading_time, get_json_object(event_json,'$.kv.loading_way') loading_way, get_json_object(event_json,'$.kv.extend1') extend1, get_json_object(event_json,'$.kv.extend2') extend2, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.type1') type1, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='loading';insert overwrite table "$APP".dwd_ad_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code,version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.entry') entry, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.contentType') contentType, get_json_object(event_json,'$.kv.displayMills') displayMills, get_json_object(event_json,'$.kv.itemId') itemId, get_json_object(event_json,'$.kv.activityId') activityId, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='ad';insert overwrite table "$APP".dwd_notification_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.action') action, get_json_object(event_json,'$.kv.noti_type') noti_type, get_json_object(event_json,'$.kv.ap_time') ap_time, get_json_object(event_json,'$.kv.content') content, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='notification';insert overwrite table "$APP".dwd_active_background_log PARTITION (dt='$do_date')
selectmid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.active_source') active_source, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='active_background';insert overwrite table "$APP".dwd_comment_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.comment_id') comment_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.p_comment_id') p_comment_id, get_json_object(event_json,'$.kv.content') content, get_json_object(event_json,'$.kv.addtime') addtime, get_json_object(event_json,'$.kv.other_id') other_id, get_json_object(event_json,'$.kv.praise_count') praise_count, get_json_object(event_json,'$.kv.reply_count') reply_count, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='comment';insert overwrite table "$APP".dwd_favorites_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.course_id') course_id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.add_time') add_time, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='favorites';insert overwrite table "$APP".dwd_praise_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.id') id, get_json_object(event_json,'$.kv.userid') userid, get_json_object(event_json,'$.kv.target_id') target_id, get_json_object(event_json,'$.kv.type') type, get_json_object(event_json,'$.kv.add_time') add_time, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='praise';insert overwrite table "$APP".dwd_error_log PARTITION (dt='$do_date')
select mid_id, user_id, version_code, version_name, lang, source, os, area, model, brand, sdk_version, gmail, height_width, app_time, network, lng, lat, get_json_object(event_json,'$.kv.errorBrief') errorBrief, get_json_object(event_json,'$.kv.errorDetail') errorDetail, server_time
from "$APP".dwd_base_event_log
where dt='$do_date' and event_name='error';" $hive -e "$sql"
三、dwd 层业务数据
3.1 数仓建模
3.2 维度表
1、商品维度表 dwd_dim_sku_info (全量)
(1) 数据来源
ods_sku_info、ods_base_trademark、ods_spu_info、ods_base_category3、ods_base_category2、ods_base_category1。
(2) 建表
DROP TABLE IF EXISTS `dwd_dim_sku_info`;
CREATE EXTERNAL TABLE `dwd_dim_sku_info` ( `id` string COMMENT '商品id', `spu_id` string COMMENT 'spuid', `price` double COMMENT '商品价格', `sku_name` string COMMENT '商品名称', `sku_desc` string COMMENT '商品描述', `weight` double COMMENT '重量', `tm_id` string COMMENT '品牌id', `tm_name` string COMMENT '品牌名称', `category3_id` string COMMENT '三级分类id',`category2_id` string COMMENT '二级分类id', `category1_id` string COMMENT '一级分类id', `category3_name` string COMMENT '三级分类名称', `category2_name` string COMMENT '二级分类名称', `category1_name` string COMMENT '一级分类名称', `spu_name` string COMMENT 'spu名称', `create_time` string COMMENT '创建时间'
)
COMMENT '商品维度表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_sku_info/'
tblproperties ("parquet.compression"="lzo");
(3) 数据加载
insert overwrite table dwd_dim_sku_info partition(dt='2020-03-10')
select sku.id, sku.spu_id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.tm_id, ob.tm_name, sku.category3_id, c2.id category2_id, c1.id category1_id, c3.name category3_name, c2.name category2_name, c1.name category1_name, spu.spu_name, sku.create_time
from ( select * from ods_sku_info where dt='2020-03-10' )
sku
join
( select * from ods_base_trademark where dt='2020-03-10'
)ob on sku.tm_id = ob.tm_id
join
( select * from ods_spu_info where dt='2020-03-10'
)spu on spu.id = sku.spu_id
join
( select * from ods_base_category3 where dt='2020-03-10'
)c3 on sku.category3_id=c3.id
join
( select * from ods_base_category2 where dt='2020-03-10'
)c2 on c3.category2_id=c2.id
join
( select * from ods_base_category1 where dt='2020-03-10'
)c1 on c2.category1_id=c1.id;
2、优惠券信息表 dwd_dim_coupon_info (全量)
(1) 数据来源
ods_coupon_info -> dwd_dim_coupon_info
(2) 建表
drop table if exists dwd_dim_coupon_info;
create external table dwd_dim_coupon_info( `id` string COMMENT '购物券编号', `coupon_name` string COMMENT '购物券名称', `coupon_type` string COMMENT '购物券类型 1 现金券 2 折扣券 3 满减券 4 满件打折券', `condition_amount` string COMMENT '满额数', `condition_num` string COMMENT '满件数', `activity_id` string COMMENT '活动编号', `benefit_amount` string COMMENT '减金额', `benefit_discount` string COMMENT '折扣', `create_time` string COMMENT '创建时间', `range_type` string COMMENT '范围类型 1、商品 2、品类 3、品牌', `spu_id` string COMMENT '商品id', `tm_id` string COMMENT '品牌id', `category3_id` string COMMENT '品类id', `limit_num` string COMMENT '最多领用次数', `operate_time` string COMMENT '修改时间', `expire_time` string COMMENT '过期时间'
) COMMENT '优惠券信息表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_coupon_info/'
tblproperties ("parquet.compression"="lzo");
(3) 加载数据
insert overwrite table dwd_dim_coupon_info partition(dt='2020-03-10')
select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, spu_id, tm_id, category3_id, limit_num, operate_time, expire_time
from ods_coupon_info
where dt='2020-03-10';
3、活动维度表 dwd_dim_activity_info (全量)
(1) 数据来源
ods_activity_info、ods_activity_rule。
(2) 建表
drop table if exists dwd_dim_activity_info;
create external table dwd_dim_activity_info( `id` string COMMENT '编号', `activity_name` string COMMENT '活动名称', `activity_type` string COMMENT '活动类型', `condition_amount` string COMMENT '满减金额', `condition_num` string COMMENT '满减件数', `benefit_amount` string COMMENT '优惠金额', `benefit_discount` string COMMENT '优惠折扣', `benefit_level` string COMMENT '优惠级别', `start_time` string COMMENT '开始时间', `end_time` string COMMENT '结束时间', `create_time` string COMMENT '创建时间'
) COMMENT '活动信息表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_activity_info/'
tblproperties ("parquet.compression"="lzo");
(3) 加载数据
insert overwrite table dwd_dim_activity_info partition(dt='2020-03-10')
select info.id, info.activity_name, info.activity_type, rule.condition_amount, rule.condition_num, rule.benefit_amount, rule.benefit_discount, rule.benefit_level, info.start_time, info.end_time, info.create_time
from
( select * from ods_activity_info where dt='2020-03-10'
)info
left join
( select * from ods_activity_rule where dt='2020-03-10'
)rule on info.id = rule.activity_id;
4、地区维度表 dwd_dim_base_province (特殊)
(1) 数据来源
ods_base_province、ods_base_region。
(2) 建表
DROP TABLE IF EXISTS `dwd_dim_base_province`;
CREATE EXTERNAL TABLE `dwd_dim_base_province` ( `id` string COMMENT 'id', `province_name` string COMMENT '省市名称', `area_code` string COMMENT '地区编码', `iso_code` string COMMENT 'ISO编码', `region_id` string COMMENT '地区id', `region_name` string COMMENT '地区名称'
) COMMENT '地区省市表'
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_base_province/'
tblproperties ("parquet.compression"="lzo");
(3) 加载数据
insert overwrite table dwd_dim_base_province
select bp.id, bp.name, bp.area_code, bp.iso_code, bp.region_id, br.region_name
from ods_base_province bp
join ods_base_region br
on bp.region_id=br.id;
5、时间维度表 dwd_dim_date_info (特殊)(预留)
(1) 建表
DROP TABLE IF EXISTS `dwd_dim_date_info`;
CREATE EXTERNAL TABLE `dwd_dim_date_info`( `date_id` string COMMENT '日', `week_id` int COMMENT '周', `week_day` int COMMENT '周的第几天', `day` int COMMENT '每月的第几天', `month` int COMMENT '第几月', `quarter` int COMMENT '第几季度', `year` int COMMENT '年', `is_workday` int COMMENT '是否是周末', `holiday_id` int COMMENT '是否是节假日')
row format delimited fields terminated by '\t'
location '/warehouse/gmall/dwd/dwd_dim_date_info/';
(2) 把 date_info.txt 文件 上传到 hadoop151 的 /opt/module/db_log/ 路径
(3) 加载数据
load data local inpath ‘/opt/module/db_log/date_info.txt’ into table dwd_dim_date_info;
3.3 事实表
1、订单明细事实表 dwd_fact_order_detail (事务型快照事实表)
(1) 关联维度
(2) 数据来源
ods_order_detail、ods_order_info
(3) 建表
drop table if exists dwd_fact_order_detail;
create external table dwd_fact_order_detail ( `id` string COMMENT '订单编号', `order_id` string COMMENT '订单号', `user_id` string COMMENT '用户id', `sku_id` string COMMENT 'sku商品id', `sku_name` string COMMENT '商品名称', `order_price` decimal(10,2) COMMENT '商品价格', `sku_num` bigint COMMENT '商品数量', `create_time` string COMMENT '创建时间', `province_id` string COMMENT '省份ID', `total_amount` decimal(20,2) COMMENT '订单总金额'
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_order_detail/'
tblproperties ("parquet.compression"="lzo");
(4) 数据加载
insert overwrite table dwd_fact_order_detail partition(dt='2020-03-10')
select od.id, od.order_id, od.user_id, od.sku_id, od.sku_name, od.order_price, od.sku_num, od.create_time, oi.province_id, od.order_price*od.sku_num
from
( select * from ods_order_detail where dt='2020-03-10'
) od
join
(select * from ods_order_info where dt='2020-03-10'
) oi
on od.order_id=oi.id;
2、支付事实表 dwd_fact_payment_info (事务型快照事实表)
(1) 关联维度
(2) 数据来源
ods_payment_info、ods_order_info。
(3) 建表
drop table if exists dwd_fact_payment_info;
create external table dwd_fact_payment_info ( `id` string COMMENT '', `out_trade_no` string COMMENT '对外业务编号', `order_id` string COMMENT '订单编号', `user_id` string COMMENT '用户编号', `alipay_trade_no` string COMMENT '支付宝交易流水编号', `payment_amount` decimal(16,2) COMMENT '支付金额', `subject` string COMMENT '交易内容', `payment_type` string COMMENT '支付类型', `payment_time` string COMMENT '支付时间', `province_id` string COMMENT '省份ID'
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_payment_info/'
tblproperties ("parquet.compression"="lzo");
(4) 加载数据
insert overwrite table dwd_fact_payment_info partition(dt='2020-03-10')
select pi.id, pi.out_trade_no, pi.order_id, pi.user_id,pi.alipay_trade_no, pi.total_amount, pi.subject, pi.payment_type, pi.payment_time, oi.province_id
from
( select * from ods_payment_info where dt='2020-03-10'
) pi
join
( select id, province_id from ods_order_info where dt='2020-03-10'
) oi
on pi.order_id = oi.id;
3、退款事实表 dwd_fact_order_refund_info (事务型快照事实表)
(1) 关联维度
(2) 数据来源
ods_order_refund_info
(3) 建表
drop table if exists dwd_fact_order_refund_info;
create external table dwd_fact_order_refund_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户ID', `order_id` string COMMENT '订单ID', `sku_id` string COMMENT '商品ID', `refund_type` string COMMENT '退款类型', `refund_num` bigint COMMENT '退款件数', `refund_amount` decimal(16,2) COMMENT '退款金额', `refund_reason_type` string COMMENT '退款原因类型', `create_time` string COMMENT '退款时间'
) COMMENT '退款事实表' PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_order_refund_info/'
tblproperties ("parquet.compression"="lzo");
(4) 加载数据
insert overwrite table dwd_fact_order_refund_info partition(dt='2020-03-10')
select id, user_id, order_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type, create_time
from ods_order_refund_info
where dt='2020-03-10';
4、评价事实表 dwd_fact_comment_info (事务型快照事实表)
(1) 关联维度
(2) 数据来源
ods_comment_info
(3) 建表
drop table if exists dwd_fact_comment_info;
create external table dwd_fact_comment_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户ID', `sku_id` string COMMENT '商品sku', `spu_id` string COMMENT '商品spu', `order_id` string COMMENT '订单ID', `appraise` string COMMENT '评价', `create_time` string COMMENT '评价时间'
) COMMENT '评价事实表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_comment_info/'
tblproperties ("parquet.compression"="lzo");
(4) 加载数据
insert overwrite table dwd_fact_comment_info partition(dt='2020-03-10')
select id, user_id, sku_id, spu_id, order_id, appraise, create_time
from ods_comment_info
where dt='2020-03-10';
5、加购事实表 dwd_fact_cart_info (周期型快照事实表,每日快照)
(1) 介绍
由于购物车的数量是会发生变化,所以导增量不合适。
每天做一次快照,导入的数据是全量,区别于事务型事实表是每天导入新增。
周期型快照事实表劣势:存储的数据量会比较大。
解决方案:周期型快照事实表存储的数据比较讲究时效性,时间太久了的意义不大,可以删除以前的数据。
(2) 关联维度
(3) 数据来源
ods_cart_info
(4) 建表
drop table if exists dwd_fact_cart_info;
create external table dwd_fact_cart_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户id', `sku_id` string COMMENT 'skuid', `cart_price` string COMMENT '放入购物车时价格', `sku_num` string COMMENT '数量', `sku_name` string COMMENT 'sku名称 (冗余)', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '修改时间', `is_ordered` string COMMENT '是否已经下单。1为已下单;0为未下单', `order_time` string COMMENT '下单时间'
) COMMENT '加购事实表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_cart_info/'
tblproperties ("parquet.compression"="lzo");
(5) 加载数据
insert overwrite table dwd_fact_cart_info partition(dt='2020-03-10')
select id, user_id, sku_id, cart_price, sku_num, sku_name, create_time, operate_time, is_ordered, order_time
from ods_cart_info
where dt='2020-03-10';
6、收藏事实表 ods_favor_info (周期型快照事实表,每日快照)
(1) 介绍
收藏的标记,是否取消,会发生变化,做增量不合适。
每天做一次快照,导入的数据是全量,区别于事务型事实表是每天导入新增。
(2) 关联维度
(3) 数据来源
ods_favor_info
(4) 建表
drop table if exists dwd_fact_favor_info;
create external table dwd_fact_favor_info( `id` string COMMENT '编号', `user_id` string COMMENT '用户id', `sku_id` string COMMENT 'skuid', `spu_id` string COMMENT 'spuid', `is_cancel` string COMMENT '是否取消', `create_time` string COMMENT '收藏时间', `cancel_time` string COMMENT '取消时间'
) COMMENT '收藏事实表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_favor_info/'
tblproperties ("parquet.compression"="lzo");
(5) 加载数据
insert overwrite table dwd_fact_favor_info partition(dt='2020-03-10')
select id, user_id, sku_id, spu_id, is_cancel, create_time, cancel_time
from ods_favor_info
where dt='2020-03-10';
7、优惠券领用事实表 dwd_fact_coupon_use (累积型快照事实表)
(1) 介绍
优惠卷的生命周期:领取优惠卷 -> 用优惠卷下单 -> 优惠卷参与支付。
累积型快照事实表使用:统计优惠卷领取次数、优惠卷下单次数、优惠卷参与支付次数。
(2) 维度关联
(3) 数据来源
ods_coupon_use、dwd_fact_coupon_use
(4) 建表
drop table if exists dwd_fact_coupon_use;
create external table dwd_fact_coupon_use( `id` string COMMENT '编号', `coupon_id` string COMMENT '优惠券ID', `user_id` string COMMENT 'userid', `order_id` string COMMENT '订单id', `coupon_status` string COMMENT '优惠券状态', `get_time` string COMMENT '领取时间', `using_time` string COMMENT '使用时间(下单)', `used_time` string COMMENT '使用时间(支付)'
) COMMENT '优惠券领用事实表'
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_coupon_use/'
tblproperties ("parquet.compression"="lzo");
注意:dt 是按照优惠卷领用时间 get_time 做为分区。
(5) 数据加载
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_fact_coupon_use partition(dt)
select if(new.id is null,old.id,new.id), if(new.coupon_id is null,old.coupon_id,new.coupon_id), if(new.user_id is null,old.user_id,new.user_id), if(new.order_id is null,old.order_id,new.order_id), if(new.coupon_status is null,old.coupon_status,new.coupon_status), if(new.get_time is null,old.get_time,new.get_time), if(new.using_time is null,old.using_time,new.using_time), if(new.used_time is null,old.used_time,new.used_time), date_format(if(new.get_time is null,old.get_time,new.get_time),'yyyy-MM-dd')
from ( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time from dwd_fact_coupon_use where dt in ( select date_format(get_time,'yyyy-MM-dd') from ods_coupon_use where dt='2020-03-10' )
)old
full outer join
( select id, coupon_id, user_id, order_id, coupon_status,get_time, using_time, used_time from ods_coupon_use where dt='2020-03-10'
)new on old.id=new.id;
8、订单事实表 dwd_fact_order_info (累积型快照事实表)
(1) 介绍
订单生命周期:创建时间 -> 支付时间 -> 取消时间 -> 完成时间 -> 退款时间 -> 退款完成时间。
由于 ODS 层订单表只有创建时间和操作时间两个状态,不能表达所有时间含义,所以需要关联订单状态表。订单事实表里面增加了活动 id,所以需要关联活动订单表。
(2) 关联维度
(3) 数据来源
ods_order_info、ods_order_status_log、ods_activity_order、dwd_fact_order_info
(4) 建表
create external table dwd_fact_order_info ( `id` string COMMENT '订单编号', `order_status` string COMMENT '订单状态',`user_id` string COMMENT '用户id', `out_trade_no` string COMMENT '支付流水号', `create_time` string COMMENT '创建时间(未支付状态)', `payment_time` string COMMENT '支付时间(已支付状态)', `cancel_time` string COMMENT '取消时间(已取消状态)', `finish_time` string COMMENT '完成时间(已完成状态)', `refund_time` string COMMENT '退款时间(退款中状态)', `refund_finish_time` string COMMENT '退款完成时间(退款完成状态)', `province_id` string COMMENT '省份ID', `activity_id` string COMMENT '活动ID', `original_total_amount` string COMMENT '原价金额', `benefit_reduce_amount` string COMMENT '优惠金额', `feight_fee` string COMMENT '运费', `final_total_amount` decimal(10,2) COMMENT '订单金额'
)
PARTITIONED BY (`dt` string)
stored as parquet
location '/warehouse/gmall/dwd/dwd_fact_order_info/'
tblproperties ("parquet.compression"="lzo");
(5) 加载数据
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table dwd_fact_order_info partition(dt)
select if(new.id is null,old.id,new.id), if(new.order_status is null,old.order_status,new.order_status), if(new.user_id is null,old.user_id,new.user_id), if(new.out_trade_no is null,old.out_trade_no,new.out_trade_no), if(new.tms['1001'] is null,old.create_time,new.tms['1001']),if(new.tms['1002'] is null,old.payment_time,new.tms['1002']), if(new.tms['1003'] is null,old.cancel_time,new.tms['1003']), if(new.tms['1004'] is null,old.finish_time,new.tms['1004']), if(new.tms['1005'] is null,old.refund_time,new.tms['1005']), if(new.tms['1006'] is null,old.refund_finish_time,new.tms['1006']), if(new.province_id is null,old.province_id,new.province_id), if(new.activity_id is null,old.activity_id,new.activity_id), if(new.original_total_amount is null,old.original_total_amount,new.original_total_amount), if(new.benefit_reduce_amount is null,old.benefit_reduce_amount,new.benefit_reduce_amount), if(new.feight_fee is null,old.feight_fee,new.feight_fee), if(new.final_total_amount is null,old.final_total_amount,new.final_total_amount), date_format(if(new.tms['1001'] is null,old.create_time,new.tms['1001']),'yyyy-MM-dd')
from ( select id,order_status, user_id, out_trade_no, create_time, payment_time, cancel_time, finish_time, refund_time, refund_finish_time, province_id, activity_id, original_total_amount, benefit_reduce_amount, feight_fee, final_total_amount from dwd_fact_order_info where dt in ( select date_format(create_time,'yyyy-MM-dd') from ods_order_info where dt='2020-03-10' )
)old
full outer join
( select info.id, info.order_status, info.user_id, info.out_trade_no, info.province_id, act.activity_id, log.tms, info.original_total_amount, info.benefit_reduce_amount, info.feight_fee, info.final_total_amount from ( select order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') tms from ods_order_status_log where dt='2020-03-10' group by order_id )log join ( select * from ods_order_info where dt='2020-03-10' )info on log.order_id=info.id left join ( select * from ods_activity_order where dt='2020-03-10' )act on log.order_id=act.order_id
)new
on old.id=new.id;
9、dwd 层业务数据导入脚本
#!/bin/bash APP=gmall
hive=/opt/module/hive/bin/hive # 如果是输入的日期按照取输入日期;如果没输入日期取当前时间的前一天
if [ -n "$2" ] ;then do_date=$2
else do_date=`date -d "-1 day" +%F`
fi sql1="
set hive.exec.dynamic.partition.mode=nonstrict;
insert overwrite table ${APP}.dwd_dim_sku_info partition(dt='$do_date')
select sku.id, sku.spu_id, sku.price, sku.sku_name, sku.sku_desc, sku.weight, sku.tm_id, ob.tm_name, sku.category3_id, c2.id category2_id, c1.id category1_id, c3.name category3_name, c2.name category2_name, c1.name category1_name, spu.spu_name, sku.create_time
from
( select * from ${APP}.ods_sku_info where dt='$do_date'
)sku
join
( select * from ${APP}.ods_base_trademark where dt='$do_date'
)ob on sku.tm_id=ob.tm_id
join
( select * from ${APP}.ods_spu_info where dt='$do_date'
)spu on spu.id = sku.spu_id
join
( select * from ${APP}.ods_base_category3 where dt='$do_date'
)c3 on sku.category3_id=c3.id
join
( select * from ${APP}.ods_base_category2 where dt='$do_date'
)c2 on c3.category2_id=c2.id
join
( select * from ${APP}.ods_base_category1 where dt='$do_date'
)c1 on c2.category1_id=c1.id; insert overwrite table ${APP}.dwd_dim_coupon_info partition(dt='$do_date')
select id, coupon_name, coupon_type, condition_amount, condition_num, activity_id, benefit_amount, benefit_discount, create_time, range_type, spu_id, tm_id, category3_id, limit_num, operate_time, expire_time
from ${APP}.ods_coupon_info
where dt='$do_date';insert overwrite table ${APP}.dwd_dim_activity_info partition(dt='$do_date')
select info.id, info.activity_name, info.activity_type, rule.condition_amount, rule.condition_num, rule.benefit_amount, rule.benefit_discount, rule.benefit_level, info.start_time, info.end_time, info.create_time
from
( select * from ${APP}.ods_activity_info where dt='$do_date'
)info
left join
( select * from ${APP}.ods_activity_rule where dt='$do_date'
)rule on info.id = rule.activity_id; insert overwrite table ${APP}.dwd_fact_order_detail partition(dt='$do_date')
select od.id, od.order_id, od.user_id, od.sku_id, od.sku_name, od.order_price, od.sku_num, od.create_time, oi.province_id, od.order_price*od.sku_num from
( select * from ${APP}.ods_order_detail where dt='$do_date'
) od
join
( select * from ${APP}.ods_order_info where dt='$do_date'
) oi on od.order_id=oi.id; insert overwrite table ${APP}.dwd_fact_payment_info partition(dt='$do_date')
select pi.id, pi.out_trade_no, pi.order_id, pi.user_id, pi.alipay_trade_no, pi.total_amount, pi.subject, pi.payment_type, pi.payment_time, oi.province_id
from
( select * from ${APP}.ods_payment_info where dt='$do_date'
)pi
join
(select id, province_id from ${APP}.ods_order_info where dt='$do_date'
)oi
on pi.order_id = oi.id; insert overwrite table ${APP}.dwd_fact_order_refund_info partition(dt='$do_date')
select id, user_id, order_id, sku_id, refund_type, refund_num, refund_amount, refund_reason_type,create_time from ${APP}.ods_order_refund_info
where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_comment_info partition(dt='$do_date')
select id, user_id, sku_id, spu_id, order_id, appraise, create_time
from ${APP}.ods_comment_info
where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_cart_info partition(dt='$do_date')
select id, user_id, sku_id, cart_price, sku_num, sku_name, create_time, operate_time, is_ordered, order_time
from ${APP}.ods_cart_info
where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_favor_info partition(dt='$do_date')
select id, user_id, sku_id, spu_id, is_cancel, create_time, cancel_time
from ${APP}.ods_favor_info
where dt='$do_date'; insert overwrite table ${APP}.dwd_fact_coupon_use partition(dt)
select if(new.id is null,old.id,new.id), if(new.coupon_id is null,old.coupon_id,new.coupon_id), if(new.user_id is null,old.user_id,new.user_id), if(new.order_id is null,old.order_id,new.order_id),if(new.coupon_status is null,old.coupon_status,new.coupon_status), if(new.get_time is null,old.get_time,new.get_time), if(new.using_time is null,old.using_time,new.using_time), if(new.used_time is null,old.used_time,new.used_time), date_format(if(new.get_time is null,old.get_time,new.get_time),'yyyy-MM-dd')
from
( select id, coupon_id, user_id, order_id, coupon_status, get_time, using_time, used_time from ${APP}.dwd_fact_coupon_use where dt in ( select date_format(get_time,'yyyy-MM-dd')from ${APP}.ods_coupon_use where dt='$do_date' )
)old full outer join
( select id, coupon_id,user_id, order_id, coupon_status, get_time, using_time, used_time from ${APP}.ods_coupon_use where dt='$do_date'
)new on old.id=new.id;insert overwrite table ${APP}.dwd_fact_order_info partition(dt)
select if(new.id is null,old.id,new.id), if(new.order_status is null,old.order_status,new.order_status), if(new.user_id is null,old.user_id,new.user_id), if(new.out_trade_no is null,old.out_trade_no,new.out_trade_no), if(new.tms['1001'] is null,old.create_time,new.tms['1001']),if(new.tms['1002'] is null,old.payment_time,new.tms['1002']), if(new.tms['1003'] is null,old.cancel_time,new.tms['1003']), if(new.tms['1004'] is null,old.finish_time,new.tms['1004']), if(new.tms['1005'] is null,old.refund_time,new.tms['1005']), if(new.tms['1006'] is null,old.refund_finish_time,new.tms['1006']), if(new.province_id is null,old.province_id,new.province_id), if(new.activity_id is null,old.activity_id,new.activity_id), if(new.original_total_amount is null,old.original_total_amount,new.original_total_amount), if(new.benefit_reduce_amount is null,old.benefit_reduce_amount,new.benefit_reduce_amount), if(new.feight_fee is null,old.feight_fee,new.feight_fee), if(new.final_total_amount is null,old.final_total_amount,new.final_total_amount),date_format(if(new.tms['1001'] is null,old.create_time,new.tms['1001']),'yyyy-MM-dd')
from
(
select id, order_status, user_id, out_trade_no, create_time, payment_time, cancel_time, finish_time, refund_time, refund_finish_time, province_id, activity_id, original_total_amount, benefit_reduce_amount, feight_fee, final_total_amount
from ${APP}.dwd_fact_order_info
where dt in (select date_format(create_time,'yyyy-MM-dd') from ${APP}.ods_order_info where dt='$do_date')
) old
full outer join
(
select info.id, info.order_status, info.user_id, info.out_trade_no, info.province_id, act.activity_id, log.tms, info.original_total_amount, info.benefit_reduce_amount, info.feight_fee, info.final_total_amountfrom ( select order_id, str_to_map(concat_ws(',',collect_set(concat(order_status,'=',operate_time))),',','=') tms from ${APP}.ods_order_status_log where dt='$do_date' group by order_id ) log join ( select * from ${APP}.ods_order_info where dt='$do_date' ) info on log.order_id=info.id left join (select * from ${APP}.ods_activity_order where dt='$do_date' ) act on log.order_id=act.order_id
) new on old.id=new.id;insert overwrite table ${APP}.dwd_dim_user_info_his_tmp
select *
from
( select id,name, birthday, gender, email, user_level, create_time, operate_time, '$do_date' start_date, '9999-99-99' end_date from ${APP}.ods_user_info where dt='$do_date' union all select uh.id, uh.name, uh.birthday, uh.gender, uh.email, uh.user_level, uh.create_time, uh.operate_time, uh.start_date, if(ui.id is not null and uh.end_date='9999-99-99', date_add(ui.dt,-1), uh.end_date) end_date from ${APP}.dwd_dim_user_info_his uh left join ( select * from ${APP}.ods_user_info where dt='$do_date' ) ui on uh.id=ui.id
)his
order by his.id, start_date; insert overwrite table ${APP}.dwd_dim_user_info_his
select *
from ${APP}.dwd_dim_user_info_his_tmp;
" sql2="
insert overwrite table ${APP}.dwd_dim_base_province
select bp.id, bp.name, bp.area_code, bp.iso_code, bp.region_id, br.region_name
firom ${APP}.ods_base_province bp
join ${APP}.ods_base_region br
on bp.region_id=br.id;"case $1 in
"first")
{$hive -e "$sql1" $hive -e "$sql2"
};; "all")
{ $hive -e "$sql1"
};;
esac
四、拉链表
1、什么是拉链表?
2、为什么要做拉链表?
3、如何使用拉链表?
4、拉链表形成过程?
5、拉链表制作流程图
6、用户维度拉链表
步骤 1:初始化拉链表(首次独立执行)
(1) 建立拉链表
drop table if exists dwd_dim_user_info_his;
create external table dwd_dim_user_info_his( `id` string COMMENT '用户id', `name` string COMMENT '姓名', `birthday` string COMMENT '生日', `gender` string COMMENT '性别', `email` string COMMENT '邮箱', `user_level` string COMMENT '用户等级', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '操作时间', `start_date` string COMMENT '有效开始日期', `end_date` string COMMENT '有效结束日期'
) COMMENT '订单拉链表'
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_user_info_his/'
tblproperties ("parquet.compression"="lzo");
(2) 初始化拉链表
insert overwrite table dwd_dim_user_info_his
select id, name, birthday, gender, email, user_level, create_time, operate_time, '2020-03-10','9999-99-99'
from ods_user_info oi
where oi.dt='2020-03-10';
步骤 2:制作当日变动数据(包括新增、修改)每日执行
(1) 如何获得每日变动表
A、最好表内有创建时间和变动时间 。
B、如果没有,可以利用第三方工具监控比如 canal,监控 MySQL 的实时变化进行记录。
C、逐行对比前后两天的数据,检查 md5(concat(全部有可能变化的字段)) 是否相同 (low)。
(2) 因为 ods_order_info 本身导入过来就是新增变动明细的表,所以不用处理
A、数据库中新增 2020-03-11 一天的数据。
B、通过 Sqoop 把 2020-03-11 日所有数据导入。
C、ods 层数据导入。
步骤 3:先合并变动信息,再追加新增信息,插入到临时表中
(1) 建立临时表
drop table if exists dwd_dim_user_info_his_tmp;
create external table dwd_dim_user_info_his_tmp( `id` string COMMENT '用户id', `name` string COMMENT '姓名', `birthday` string COMMENT '生日', `gender` string COMMENT '性别', `email` string COMMENT '邮箱', `user_level` string COMMENT '用户等级', `create_time` string COMMENT '创建时间', `operate_time` string COMMENT '操作时间', `start_date` string COMMENT '有效开始日期', `end_date` string COMMENT '有效结束日期'
) COMMENT '订单拉链临时表'
stored as parquet
location '/warehouse/gmall/dwd/dwd_dim_user_info_his_tmp/'
tblproperties ("parquet.compression"="lzo");
(2) 导入数据
insert overwrite table dwd_dim_user_info_his_tmp
select * from
( select id, name, birthday, gender, email,user_level, create_time, operate_time, '2020-03-11' start_date, '9999-99-99' end_date from ods_user_info where dt='2020-03-11' union all select uh.id, uh.name, uh.birthday, uh.gender, uh.email, uh.user_level, uh.create_time, uh.operate_time, uh.start_date, if(ui.id is not null and uh.end_date='9999-99-99', date_add(ui.dt,-1), uh.end_date) end_date from dwd_dim_user_info_his uh left join ( select * from ods_user_info where dt='2020-03-11' ) ui on uh.id=ui.id
)his
order by his.id, start_date;
步骤 4:把临时表覆盖给拉链表
insert overwrite table dwd_dim_user_info_his select * from dwd_dim_user_info_his_tmp;
五、dwd 层总结
1、dwd 层采用 parquet 存储 + lzo 压缩的方式。
2、dwd 层是数据仓库中的关键一层,数据仓库建模在这一层完成。
3、dwd 层用户行为表 12 张,业务数据表 14 表,共计 26 张表。
电商数仓(dwd 层)相关推荐
- 电商数仓DWD层用户行为日志解析
文章目录 前言 一.页面埋点日志.启动日志结构 二.日志解析的流程 2.1 启动日志表解析(包括注意事项) 2.1.1 解析思路 2.1.2 建表语句 2.1.3 数据导入 2.1.4 注意事项 2. ...
- 数据仓库之电商数仓-- 3.4、电商数据仓库系统(ADS层)
目录 九.数仓搭建-ADS层 9.1 建表说明 9.2 访客主题 9.2.1 访客统计 9.2.2 路径分析 9.3 用户主题 9.3.1 用户统计 9.3.2 用户变动统计 9.3.3 用户行为漏斗 ...
- 数据仓库之电商数仓-- 3.3、电商数据仓库系统(DWT层)
目录 八.数仓搭建-DWT层 8.1 访客主题 8.2 用户主题 8.3 商品主题 8.4 优惠券主题 8.5 活动主题 8.6 地区主题 8.7 DWT层首日数据导入脚本 8.8 DWT层每日数据导 ...
- 数据仓库之电商数仓-- 3.2、电商数据仓库系统(DWS层)
目录 七.数仓搭建-DWS层 7.1 系统函数 7.1.1 nvl函数 7.1.2 日期处理函数 7.1.3 复杂数据类型定义 7.2 DWS层 7.2.1 访客主题 7.2.2 用户主题 7.2.3 ...
- 数据仓库之电商数仓-- 4、可视化报表Superset
目录 一.Superset入门 1.1 Superset概述 1.2 Superset应用场景 二.Superset安装及使用 2.1 安装Python环境 2.1.1 安装Miniconda 2.1 ...
- 数据仓库之电商数仓-- 2、业务数据采集平台
目录 一.电商业务简介 1.1 电商业务流程 1.2 电商常识(SKU.SPU) 1.3 电商系统表结构 1.3.1 活动信息表(activity_info) 1.3.2 活动规则表(activity ...
- 数据仓库之电商数仓-- 1、用户行为数据采集
目录 一.数据仓库概念 二.项目需求及架构设计 2.1 项目需求分析 2.2 项目框架 2.2.1 技术选型 2.2.2 系统数据流程设计 2.2.3 框架版本选型 2.2.4 服务器选型 2.2.5 ...
- 2 大数据电商数仓项目——项目需求及架构设计
2 大数据电商数仓项目--项目需求及架构设计 2.1 项目需求分析 用户行为数据采集平台搭建. 业务数据采集平台搭建. 数据仓库维度建模(核心):主要设计ODS.DWD.DWS.AWT.ADS等各个层 ...
- 电商数仓描述_笔记-尚硅谷大数据项目数据仓库-电商数仓V1.2新版
架构 项目框架 数仓架构 存储压缩 Snappy与LZO LZO安装: 读取LZO文件时,需要先创建索引,才可以进行切片. 框架版本选型Apache:运维麻烦,需要自己调研兼容性. CDH:国内使用最 ...
最新文章
- 『浅入浅出』MySQL 和 InnoDB
- 使用DocFX生成文档
- hdu 1115(多边形重心)
- 获取计算机的信息(IP地址、MAC地址、CUP序列号、硬盘序列号、主板信息等等)...
- 五分钟了解数据库事务隔离
- python:校验邮箱格式
- 华为C8825D刷机失败解决方法
- 华为云GuassDB(for Redis)发布全新版本推出:Lua脚本和SSL连接加密
- 基于SVD的推荐算法
- 微课|中学生可以这样学Python(2.2.3节):in和is
- css 修改文字基准线_css外部样式表怎么写
- 不要拿ERP的报表忽悠领导!——一个报表引发的企业经营反思
- 马化腾:一推就倒!中国技术实力只是表面辉煌罢了
- 【操作系统】输入输出系统(中)-思维导图
- 打造世界领先的智能运维大脑,必示科技获顺为资本领投数千万A轮融资
- 传富士康将在印度建世界最大代工厂
- LINUX SHELL判断一个用户是否存在
- C++ 归并排序与快速排序
- oracle /etc/fonts simfang.ttf,xelatex 无法找到方正字体
- 10本Java网站开发必看书籍