hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）

环境：一般宽表建表可能考虑存储更多信息选择复杂模型建设

复杂数据类型：array、map、struct
1.数组array，里边不能装不同类型的数据

more hive_array.txt
zhangsan beijing,shanghai,tianjin,hangzhou
lisi changchun,chengdu,wuhan,beijing

创建表
create table hive_array(name string, work_locations array)
row format delimited fields terminated by ‘\t’
collection items terminated by ‘,’;

hive> desc formatted hive_array;

#col_name data_type comment
name string
work_locations array

#加载本地文件
load data local inpath ‘/home/hadoop/data/hive_array.txt’ overwrite into table hive_array;

#查询数据
hive> select * from hive_array;
OK
ruoze [“beijing”,“shanghai”,“tianjin”,“hangzhou”]
jepson [“changchun”,“chengdu”,“wuhan”,“beijing”]

hive> select name, size(work_locations) from hive_array;
ruoze 4
jepson 4

hive> select name, work_locations[0] from hive_array;
ruoze beijing
jepson changchun

hive> select * from hive_array where array_contains(work_locations, “tianjin”);
ruoze [“beijing”,“shanghai”,“tianjin”,“hangzhou”]

2.map Map(‘a’#1,‘b’#2)
more hive_map.txt
1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

#创建表
create table hive_map(id int,name string, family map<string,string>,age int)
row format delimited fields terminated by ‘,’
collection items terminated by ‘#’
map keys terminated by ‘:’;

#查看表
hive> desc formatted hive_map;
OK
#col_name data_type comment

id int
name string
family map<string,string>
age int

#加载本地数据
load data local inpath ‘/home/hadoop/data/hive_map.txt’
overwrite into table hive_map;

查询方式
map的使用： map是键值对，即(key,value)形式。在一个键值对中，要求其key唯一，否则将覆盖掉其value。
map结构：fieldName(k1:v1,k2:v2,…)
取值语法： fieldName[‘key’]，通过方括号（[]）来取

hive> select * from hive_map;

1 zhangsan {“father”:“xiaoming”,“mother”:“xiaohuang”,“brother”:“xiaoxu”} 28
2 lisi {“father”:“mayun”,“mother”:“huangyi”,“brother”:“guanyu”} 22
3 wangwu {“father”:“wangjianlin”,“mother”:“ruhua”,“sister”:“jingtian”} 29
4 mayun {“father”:“mayongzhen”,“mother”:“angelababy”} 26

hive> select id,name,family[‘father’] as father, family[‘sister’] from hive_map;
OK
1 zhangsan xiaoming NULL
2 lisi mayun NULL
3 wangwu wangjianlin jingtian
4 mayun mayongzhen NULL

hive> select id,name,map_keys(family) from hive_map;
OK
1 zhangsan [“father”,“mother”,“brother”]
2 lisi [“father”,“mother”,“brother”]
3 wangwu [“father”,“mother”,“sister”]
4 mayun [“father”,“mother”]

hive> select id,name,map_values(family) from hive_map;
OK
1 zhangsan [“xiaoming”,“xiaohuang”,“xiaoxu”]
2 lisi [“mayun”,“huangyi”,“guanyu”]
3 wangwu [“wangjianlin”,“ruhua”,“jingtian”]
4 mayun [“mayongzhen”,“angelababy”]

hive> select id,name,size(family) from hive_map;
OK
1 zhangsan 3
2 lisi 3
3 wangwu 3
4 mayun 2

hive> select id,name,family[‘brother’] from hive_map where array_contains(map_keys(family),‘brother’);
OK
1 zhangsan xiaoxu
2 lisi guanyu

3.struct结构体
//原始数据
cat hive_struct.txt
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

//建表并导入数据
create table hive_struct(ip string,userinfo structname:string,age:int)
row format delimited fields terminated by ‘#’
collection items terminated by ‘:’;

#加载数据
load data local inpath ‘/home/hadoop/data/hive_struct.txt’
overwrite into table hive_struct;

#查询数据
hive> select * from hive_struct;

192.168.1.1 {“name”:“zhangsan”,“age”:40}
192.168.1.2 {“name”:“lisi”,“age”:50}
192.168.1.3 {“name”:“wangwu”,“age”:60}
192.168.1.4 {“name”:“zhaoliu”,“age”:70}

//取值
struct的使用： struct是结构体，其定义为： filedName struct filed1:type1,field2:type2,… 表示该字段由多个字段组合而成。
取某个字段的语法为： fieldname.field1, 通过点（.）来取

hive> select ip,userinfo.name,userinfo.age from hive_struct;
OK
192.168.1.1 zhangsan 40
192.168.1.2 lisi 50
192.168.1.3 wangwu 60
192.168.1.4 zhaoliu 70

4、map和struct 结合
建表语句

以borrow_repay_record为例：其key为phaseNumber,value为一个struct。
map(string,struct<…>) 显然，value的类型可以是复杂数据类型，这就形成了复杂数据类型的嵌套。其语法仍然符合各个基本类型的语法规则如，取出其对应的map 的key 为load 的value中对应 duedate:的值
语法为

select borrow_repay_recore[‘load’].duedate from dw_kuanbiao where dt=‘2019-02-12’

当不知道key（或者不关心key），如何来取出满足需求的value？这就用到了map的展开（将一行变为多行）
我们取出user_id为100000的记录对应的 borrow_repay_record (注意user_id取出的值存在多行情况 )
结果结构类似
{“19”:{“duedate”:“2015-04-23 14:51:42”,“repayoverduemgmtfee”:null，},
“18”:{“duedate”:“2015-03-23 14:51:42”,“repayoverduemgmtfee”:null，},
“15”:{“duedate”:“2014-12-23 14:51:42”,“repayoverduemgmtfee”:null，******},
“14”:{“duedate”:“2014-11-23 14:51:42”,“repayoverduemgmtfee”:null，*****}}

以看到，这一行当中，其实包含了相当多的信息。
为了能够获取任意一行中的任意一个字段，而不是通过key索引来寻找该字段，我们需要将上述一行，按照key ，value的形式打散，化为多行，并能够与表中的其他字段进行融合。而hive则提供了相关函数。
explode() 函数，能够将一行打散为多行，但该函数无法将打散出来的行与表的其他字段进行融合。
LATERAL VIEW 则能够弥补这一缺点，二者一般配合使用。
举例如下：

SELECT user_id,phaseNumber,value from dw_loan LATERAL VIEW explode(borrow_repay_record) adTable AS phaseNumber,value where dt = ‘2019-01-29’ and user_id = ‘100000’

通过LATERAL VIEW explode(borrow_repay_record) adTable AS phaseNumber,value 可以将map中的数据按行切分，并与原来的行中连接，形成多行。
可以认为， from后面就是一个表，和平常用的表并无区别。
那么，如果要算某个字段的和的时候，则直接使用就ok：

如，要计算本金的和，map 中value的某个字段值的情况：

select sum(value.principal) from dw_kubiao LATERAL VIEW explode(borrow_repay_recore) adTable As phaseNumeber,value where user_id =‘100000’

注：在宽表建设过程中，使用了hive的复杂数据类型，如map, struct, 以及复杂数据类型的嵌套，如map<string, struct> 等，

虽然hive复杂数据类型能够让单行记录容纳更多的信息，但也导致了加载过程的复杂。为了简化这些包含复杂数据类型的表的加载过程，采用了中间表。即先把数据按照最终表的数据结构导入到中间表，再利用MR清洗一遍中间表，使其满足复杂数据类型的要求。
（即先将数据导入到 tmep_kubiao中----》 dw_kubiao中）

其中map<string,struct> 由原来string 类型替换而来

temp_kubiao 表定义
…
COMMENT ‘标的信息表’
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘field.delim’=’,’,
‘serialization.format’=’,’)
…
dw_kubiao 表定义
…
ROW FORMAT SERDE
‘org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe’
WITH SERDEPROPERTIES (
‘colelction.delim’=’|’,
‘field.delim’=’,’,
‘mapkey.delim’=’:’,
‘serialization.format’=’,’
)

string 对应字段鸿以| 分隔成不同 key : struct （让后将 struct 中的分隔符由原来的@ 换成\004 --》由单独mr 实现）

注：map中多个元素的分隔符以及 struct多个元素的分隔符，目前hive提供的语法是无法都更改的，只能够更改一个。剩下的分隔符则按照 ascii 码 1- 8的顺序进行使用。当指定 colection的分隔符为 ’ | ', 实际上是指定了 map 结构的元素分隔符，那么 struct元素的分隔符则默认为 ‘\004’, 因此，只需要把 struct的分隔符改为 ‘\004’ 即可。

注：：：：map 中分隔方式暂时没找到修改语句。采用修改元数据然后重新加载hive 分区的方式实现

hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）相关推荐

Hive 数仓中常见的日期转换操作
(1)Hive 数仓中一些常用的dt与日期的转换操作下面总结了自己工作中经常用到的一些日期转换,这类日期转换经常用于报表的时间粒度和统计周期的控制中日期变换: (1)dt转日期 to_date(f ...
Flink 1.11 与 Hive 批流一体数仓实践
导读:Flink 从 1.9.0 开始提供与 Hive 集成的功能,随着几个版本的迭代,在最新的 Flink 1.11 中,与 Hive 集成的功能进一步深化,并且开始尝试将流计算场景与Hive 进行 ...
hive表ddl导出_Flink 1.11 与 Hive 批流一体数仓实践
简介:Flink 从 1.9.0 开始提供与 Hive 集成的功能,随着几个版本的迭代,在最新的 Flink 1.11 中,与 Hive 集成的功能进一步深化,并且开始尝试将流计算场景与Hive 进行 ...
数仓中指标-标签，维度-度量，自然键-代理键，数据集市等各名词解析及关系
序列号内容链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...
Apache Doris在美团外卖数仓中的应用实践
来自:美团技术团队美团外卖数据仓库通过MOLAP+ROLAP双引擎模式来适配不同应用场景.MOLAP引擎使用了Apache Kylin.ROLAP我们经过综合考虑,选择了Apache Doris.本 ...
数仓中应该出现的所有表格
数仓中应该出现的所有表格及其逻辑 1.ods_app_log(app日志贴源表) 计算:详情请见数据预处理整体代码实现源表:原始数据 +--------------+---------------- ...
数仓中指标-标签，维度-度量，自然键-代理键等各名词深度解析
作为一个数据人,是不是经常被各种名词围绕,是不是对其中很多概念认知模糊.有些词虽然只有一字之差,但是它们意思完全不同,今天我们就来了解下数仓建设及数据分析时常见的一些概念含义及它们之间的关系. 本文首 ...
hive udf 分组取top1_项目实战从0到1之hive（27）数仓项目（九）数仓搭建 DWS 层
点击上方蓝字关注我们一.数仓搭建 - DWS 层 1.1 业务术语 1)用户用户以设备为判断标准,在移动统计中,每个独立设备认为是一个独立用户.Android 系统根据 IMEI 号,IOS 系统 ...
数仓中的口径及常用口径
最近去面试,被面试官问到,你们的数仓搭建过程中的口径是什么?当时一脸懵逼,不知道如何回答,这是什么鬼?后来阅读了几篇博文,哦~~~原来口径指的就是你的取数逻辑,也就是你们的一套规则,口径是统计学中的一 ...

hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）

hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）相关推荐

最新文章

热门文章

hive 复杂数据类型 在数仓中应用（array、map、struct、和其组合应用）

hive 复杂数据类型 在数仓中应用（array、map、struct、和其组合应用）相关推荐

最新文章

热门文章

hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）

hive 复杂数据类型在数仓中应用（array、map、struct、和其组合应用）相关推荐