hive相关知识与操作

数据仓库

概念：数据仓库是集成化的数据分析平台

数据库：支持业务，支持事务（数据库读的压力大）

数据仓库：支持分析，为企业决策提供支持

注意：数据库和数据仓库不同，数据仓库主要是用来开展分析的

特点

不生产数据，来源于包括数据库在内的各个数据源
不消费数据，分析数据的结果给各个数据应用使用

主要特征

面向主题：和分析的需求相关
集成：确定分析的主题之后，寻找和主题相关的各个数据源数据，经过抽取，转化，加载，即ETL，变成格式统一干净规整的数据，填充到数据仓库的主题下面
非易失性（不可更新）：数仓是数据分析的平台，不是数据创造的平台，分析数据的规律，数仓中的数据都是已经产生的历史数据，用于离线分析
时变性：数仓的数据随着时间成周期性变化，和分析的频率相关

OLTP和OLAP的区别

OLTP t 事务联机事务处理面向事务面向业务，也就是我们所谓的关系型数据库,注意不是非关系型数据库

olap a 分析联机分析处理，面向分析，也就是我们所说的数据分析仓库Apache Hive,Apache impala

注意：数仓绝不是为了取代数据库而出现，分了分析而出现数据库必须有，数仓是根据公司需求定有没有

数据仓库的分层架构

因为数仓本身不生产数据不消费数据，按照数据的流入流出进行分层

ODS:临时存储层，一般不用于直接开展数据分析，来源于各个不同数据源
DW:数据仓库层也称为细节层，其数据来自云ODS层，经过ETL而来，当中的数据呈现着主题的特性，格式一般都是统一干净规整的
DA：数据应用层层，最终消费使用数据，其数据来自于DW

为什么要分层？

解耦合，分布执行，降低出问题的风险

用空间换时间，用多步换取最终使用的数据局的高效性

Apache Hive

如何理解hive是基于hadoop的数仓？

1.数据存储的能力---Hadoop HDFS2.数据分析的能力---Hadoop MapReduce

把一份结构化的数据映射成为一张表，提供了用户编写sql，hive转化成为mr程序

Hive的安装部署

matadata 元数据，解释性数据，hive元数据指描述表和文件之间映射关系的数据

Metastore 元数据服务，存储元数据的映射关系，提供了一个统一的服务接口，便于各个客户端通过metastore访问元数据

metastore的三种模式

内嵌模式:使用内嵌的derby数据库来存储元数据，不需要器额外的metastore服务

本地模式：使用外部数据库来存储数据
1. 特点：使用第三方数据库如mysql存储元数据metastore服务不需要单独配置启动，每启动一个hive服务，都会内置一个metasrtore服务
2. 配置
  
  根据hive.metastire, uris 的值是否为空来判断是否为本地模式
- 前提：必须安装好mysql，确保mysql正常运行，并且具备访问访问的权限
  
  a) yum安装mysql
  
  yum install mysql mysql-server mysql-devel
  
  b) 初始化mysql,设置root的密码权限
  
  /etc/init.d/mysql start
  
  mysql //进入mysql，命令行
  
  update user set Password=PASSWORD(‘123456’) WHERE user=‘root’; //设置密码
  
  FLUSH PRIVILEGES; //刷新授权
  
  ctrl c退出mysql命令行
```
mysql -u root -p  回车并输入上面设置的密码，然后进入mysql服务GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123456' WITH GRANT OPTION; //设置远程管理mysql
```
  c) 启动mysql服务
  
  service mysqld start|stop 单次启动关闭mysql
  
  chkconfig mysqld on 设置为开机自启动
hive的本地模式具体配置

解压安装hive

进入解压的hive文件夹，修改配置文件

改名为 hive-env.sh     加export 自己创建hive-site.xml。选择xml格式，到讲义上复制，在notepad中调整格式

<configuration><property><name>javax.jdo.option.ConnectionURL</name><value>jdbc:mysql://node02:3306/hive211?createDatabaseIfNotExist=true</value></property><property><name>javax.jdo.option.ConnectionDriverName</name><value>com.mysql.jdbc.Driver</value></property><property><name>javax.jdo.option.ConnectionUserName</name><value>root</value></property><property><name>javax.jdo.option.ConnectionPassword</name><value>123456</value></property><property><name>hive.metastore.uris</name><value>thrift://node02:9083</value></property><property><name>hive.server2.thrift.bind.host</name><value>node02</value></property>
</configuration>

启动 bin/hive会报错，打开lib目录查看，因为没有mysql驱动，把驱动放到hive安装包下的lib目录下

本地模式的弊端

每个客户端都需要知道mysql的密码

每启动一个服务，就会内嵌一个metadatastore服务，浪费资源

远程模式：

特点：使用第三方mysql来存储元数据，需要单独配置启动metastore
配置：

在本地模式的基础上添加一个参数hive.metastore.uris（不写默认是本地模式）

修改hive-site.xml ，放到配置中的最后一个地方
```
<property><name>hive.metastore.uris</name><value>thrift://node02:9083</value>
</property>
```
启动：必须首先手动单独启动metastore服务，再使用hive客户端进行连接
```
bin/hive  --service metastore
```
其他的机器可以删除其他配置，不能删除找metastore的配置
使用：可以在任何一台机器上使用bin/hive进行连接
hive的第一代和第二代客户端

第一代 bin/hive 过时

第二代 bin/beeline 需要配合hiveserve2服务使用，不直接使用metastore

配置：

在hive-site.xml中增加一个配置（指定hiveserver2服务部署的机器）

  <property><name>hive.server2.thrift.bind.host</name><value>node02</value></property>

启动的时候需要启动metastore和hiveserver2服务、可以使用后台启动

nohup /export/servers/hive/bin/hive --service metastore &nohup /export/servers/hive/bin/hive --service hiveserver2 &

使用bin/beeline启动

使用!connect jdbc:hive2://node01:10000

后台启动两个服务

启动顺序：hadoop—mysql—metastore—hiveserver2,最后使用beeline客户端进行连接，分成前台启动和后台启动

启动之后有好几个runjar

Hive的基本操作

映射

如何把hdfs上的一个结构化数据文件映射成为hive的一张表？

文件路径必须放在数据库表名下，可以创建外部表指定路径
建表的字段顺序和类型一定要和文件一致
建表时需要指定分割符，也有默认的分隔符

三张表的数据保存在hdfs上，映射保存在mysql中

查看是不是有mr程序执行，打开网址8088看

hadoop fs -chmod -R 755 /

DDl数据定义语言

数据类型

要求：

字段的类型和顺序要和结构化文件保持一致

hive除了支持sql类型，还支持java类型，大小写不敏感

hive除了支持基本类型，还支持复合类型（map,array）针对复合类型要跟分隔符的指定有关

hive读取数据映射的机制

首先使用inputformat进行读取数据，默认TextinputFormat(一行一行读)

然后调用

分割符的指定

语法格式：

ROW FORMAT DELIMITED | SEREDROW FORMAT:表明开始指定分隔符，如果不指定分隔符，就采取默认的分隔符，输入文件内容时输入ctrl v,ctrl a（会出现蓝色的^A）建表时只需写create table t_m(id int,name string,age int);即可映射成功，DELIMITED：使用内置默认的类来进行数据切割SERED：使用其他的类来进行数据切割

练习：先创建表

 在将自己建的文件上传到该表下

内部表外部表

放置的路径不一样

内部表：表对应的结构化数据文件位置都位于hive指定的默认的路径下

外部表：表对应的结构化数据文件位置可以位于hdfs任意路径，需要location指定清楚
删除结果不一样

内部表删除的时候，连同其对应的文件一起删除

外表表删除的时候，只删除hive中定义的表信息，文件不会删除（如果文件下次还用，使用外部表）

分区

在涉及查询的时候，如果表下面对应的文件有多个，需要进行全表扫描，性能不高，如何优化

分区表的创建

create table t_user_p(id int,name string,country string) row format delimited fields terminated by "," partitioned by (country string);---错误
---有顺序，分区在前，指定分隔符在后create table t_user_p(id int,name string,country string) partitioned by (country string) row format delimited fields terminated by "," ;--错误
-----分区的字段不能是已经存在的字段create table t_user_p(id int,name string,country string) partitioned by (guojia string) row format delimited fields terminated by "," ;---正确

分区表的加载

使用load命令进行数据加载，加载的时候指定分区信息

load data local inpath '/root/hivedata/china.txt' into table t_user_p partition(guojia='zhongguo');load data local inpath '/root/hivedata/usa.txt' into table t_user_p partition(guojia='usa');

用于全表查询时的优化

先分区再指定分隔符

分桶

语法格式

clustered by xxx into N buckets把文件根据某个字段分成几个部分

如何分？

如果xxx是字符串类型 xxx.hashcode % N 余数为几就分到哪个部分

如果xxx是数值类型， xxx%N 余数为几就去哪个部分
优化join查询时候笛卡尔积的数量，优化表

分通表的创建

create table stu_buck(Sno int,Sname string,Sex string,Sage int,Sdept string) clustered by(Sno)
into 4 buckets
row format delimited
fields terminated by ',';

分桶表的数据加载

--真正导入分桶表数据的方式  间接导入
set hive.enforce.bucketing = true;  --开启分桶的功能
set mapreduce.job.reduces=4;        --指定分桶的个数 和建表的时候保持一致--1、创建一个临时的普通表
create table student(Sno int,Sname string,Sex string,Sage int,Sdept string)
row format delimited
fields terminated by ',';--2、把结构化的数据文件加载到临时普通表中
hadoop fs -put students.txt /user/hive/warehouse/itcast.db/student--3、通过分桶查询把数据插入分桶表中 insert+select
insert overwrite table stu_buck select * from student cluster by(Sno);

分桶表的使用

直接和正常表使用一样，只不过底层文件被分开了

分桶表的总结

最大的意义：优化join查询的时候笛卡尔积的数量优化表
分桶表是一种优化的表建表的时候可选
分桶的字段必须是表中已经存在的字段
分桶的功能默认没有开启的需要手动设置开启
分桶的最终形式来看就是原来完整的文件被按照规则分成了若干个部分

原理很重要！！！！理解

哪个是客户端，哪个是服务器

DML数据操纵语言

修改表

增加分区：

      1.mkdir创建hdfs上分区的文件夹：

/user/hive/warehouse/it.db/t_user_p/guojia=japan

   2.使用sql语句把该分区目录加载到hive表信息中（客户端写）

alter table t_user_p add partition(guojia='japan') location '/user/hive/warehouse/it.db/t_user_p/guojia=japan' ;

 3.建一个japan.txt的文件，将该文件put到该分区目录下

显示命令

   show tablesshow partitions 表名（企业中一般都是分区表），如果该表不是分区表会报错error while processing statement(逻辑错误)如果出现error while compling statement,是编译期间的错误，一般是sql语法错误desc formatted 表名  查看表信息

DML操作

load ：

将数据加载到表中,hive不会进行任何转化

官方推荐，而不使用hadoop fs -put

本质是把文件从本地文件系统或者hdfs文件系统移动到hive表对应的hdfs文件夹下

建表 create table

加载 load data local inpath ‘/root/hivedata/stu.txt’ into table stu

在客户端执行，会把服务端的文件加载进去

删除文件夹加-r

从本地加载的时候，实质就是复制操作，相当于hadoop fs -put
```
load data local inpath '/root/hivedata/students.txt' into table stu_locall;
```
从hdfs加载的时候，实质是移动操作，hdfs下的文件就消失了
```
load data inpath '/students.txt' into table stu_no_locall;
```
insert

hive中，insert配合select使用，插入表的数据来自于后面的查询语句返回的结果

注意：保证查询返回的结果字段顺序类型和待插入表一致，如果不一致,hive尝试换行，但是不一定成功，如果成功，就显示数据，如果不成功，就显示null

多重插入：一次扫描，多次插入
```
格式：from source_table
insert overwrite table test_insert1
select id
insert overwrite table test_insert2
select name;
```

多分区表：在前一个分区的基础上继续分区，是一个递进的关系，当下常见的是两个分区

建表–有两个分区

create table t_user_double_p(id int,name string, country string) partitioned by (guojia string, sheng string) row format delimited fields terminated by ',';

加载

load data local inpath '/root/hivedata/china.txt' into table t_user_double_p partition(guojia='zhongguo', sheng='beijing');load data local inpath '/root/hivedata/china.txt' into table t_user_double_p partition(guojia='zhongguo', sheng='shanghai');load data local inpath '/root/hivedata/usa.txt' into table t_user_double_p partition(guojia='meiguo', sheng='niuyue');

动态分区

分区字段的值如果是在加载数据的时候，手动写死的指定的，这叫做静态分区，否则叫做动态分区

动态分区的操作

启动动态分区的功能（可以查看是否开启）

set hive.exec.dynamic.partition=true;
可以直接通过set hive.exec.dynamic.partition; 查看是否开启

指定动态分区的模式--非严格？

strict  严格模式  ---当中至少有一个分区是写死的
nonstrict 非严格模式  所有分区都可以是动态分区set hive.exec.dynamic.partition.mode=nonstrict;

动态分区数据的插入（查询返回了3个字段，其中第一个字段作为表的真实字段，后面两个查询返回的字段将决定数据的分区值）

insert overwrite table d_p_t partition (month,day)
select ip,substr(day,1,7) as month,day from dynamic_partition_table;select substr(day,1,7) as month,day,ip from dynamic_partition_table; --错误的 动态分区是根据位置来确定的

需求：
将dynamic_partition_table中的数据按照时间(day)，插入到目标表d_p_t的相应分区中。原始表：
create table dynamic_partition_table(day string,ip string)row format delimited fields terminated by ","; load data local inpath '/root/hivedata/dynamic_partition_table.txt' into table dynamic_partition_table;
2015-05-10,ip1
2015-05-10,ip2
2015-06-14,ip3
2015-06-14,ip4
2015-06-15,ip1
2015-06-15,ip2目标表：
create table d_p_t(ip string) partitioned by (month string,day string);静态分区：
load data local inpath ip1 into table d_p_t partition(month="2015-05",day ="10");动态插入：
insert overwrite table d_p_t partition (month,day)
select ip,substr(day,1,7) as month,day from dynamic_partition_table;select substr(day,1,7) as month,day,ip from dynamic_partition_table; --错误的 动态分区是根据位置来确定的

导出表数据

把select查询语句的结果导出到文件系统中（会覆盖原有的内容）
```
insert overwrite local directory '/'
select * from stu_tmp;insert overwrite directory '/aaa/test'
select * from stu_tmp;
```
注意overwite操作是一个覆盖操作，慎用

本地都和hive服务器有关

select（常用）

分桶查询：把数据按照指定的字段分成若干个部分

    可以根据指定的字段分开，并且根据该字段默认升序排序

------没有指定分成几桶

select * from student cluster by(Sno);  --默认会根据输入的数量自动评估服务端执行的日志都在~目录下的nohup下，tail -f  nohup.out查看

-------指定分成几桶

--指定分桶的个数
set mapreduce.job.reduces=2;
Number of reduce tasks not specified. Defaulting to jobconf value of: 2

需求：根据一个字段分，根据另一个字段排序（根据学号分，根据年龄倒序排序）

select * from student DISTRIBUTE by(Sno) sort by(sage desc);
distribute by负责分，sort by负责排序，这两个字段不一样，如果这两个字段一样，就相当于cluster by

order by 全局排序

不管当前系统中设置多少个reducetask,最终编译决定一个，因为只有一个才能全局排序的需求

select * from student order by sage;

hive中的join

inner join 内关联：只显示关联上的左右两边的数据，关联不上的不显示

left join 左关联：以左表为准，显示左表所有数据，右边与之关联，关联上显示结果，关联不上显示null

right join 有关联，原理同左关联

outer join 外关联：显示所有的结果

left semi join ：内关联只显示左表的数据

cross join 交叉关联笛卡尔积慎用

HIVE 参数配置

shell命令行

bin/hive 连接metastore服务，访问hive

在启动的时候添加相关参数，执行hive sql

bin/hive  -e  "xxx" 启动执行后面的sql语句，执行完退出bin/hive -f xxxx.sql   启动执行后面的sql脚本，执行完结束
上述两种hive的启动方式，是生产环境中常见的调度方式

启hive服务启动时候的日志

--没有开启日志的启动方式
/export/servers/hive/bin/hive --service metastore--开启日志查看启动过程
/export/servers/hive/bin/hive --service metastore --hiveconf hive.root.logger=DEBUG,console

hive中参数配置

配置方式

在安装包conf路径下 conf/hive-site.xml 全局有效，影响安装包任何启动方式
–hiveconf k1=v1 启动进程有效，谁启动谁设置谁生效
set k1=v1 连接会话session有效，谁连接谁社会谁生效。session结束失效

优先级

set  >  hiveconf >hive-site.xml  >hive-default.xml

hive作为基于hadoop的数仓软件，会把hadoop的配置文件加载到自己的工作中

虚表测试

创建一个dual
```
create table dual(id string)
```
创建dual.txt,输入内容给为一个空格

load加载，映射成为一张表
```
load data local inpath '/root/hivedata/dual.txt' into table dual;
```
可以select查看

虚表的练习

select 1 from dual where 1=1;select 4 & 8 from dual;select 4 | 8 from dual;•取整函数: round-----四舍五入select round(3.5) from dual;-----结果为4
•指定精度取整函数: roundselect round(3.1415926,4) from dual;----结果为3.1416
•向下取整函数: floor
•向上取整函数: ceilselect ceil(3.1415926) from dual;----结果为4
•向上取整函数: ceiling
•取随机数函数: randselect rand() from dual;select rand(int n) from dual; ----如果生成随机数时指定n会生成一个稳定的值
时间日期：select unix_timestamp() from dual;  ---获取当前系统的时间戳值
select from_unixtime(1323308943,'yyyyMMdd') from dual;---将时间戳值转化成日期格式select year('2011-12-08 10:03:01') from dual;---获取年

条件判断函数（重点）case

条件判断函数： CASE

语法 : CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END

返回值 : T

说明：如果 a 等于 b ，那么返回 c ；如果 a 等于 d ，那么返回 e ；否则返回 f

举例：

hive> Select case 100 when 50 then ‘tom’ when 100 then ‘mary’ else ‘tim’ end from dual;

mary

时间戳：从1970.1.1到现在的秒数

UDF开发

hive函数

内置函数
自定义函数

自定义函数的分类

UDF 用户定义的普通函数，输入一行输出一行

UDAF 聚合函数，多进一出

UDTF 用户定义表生成函数，描述了输入一行输出多昂

自定义普通函数的开发流程：

继承UDT类

重载evalute方法（自定义函数的逻辑就该该方法的逻辑

打成jar包
把jar包添加境变量classpath中(在客户端写)
```
add jar /root/hivedata/example_utf-1.0-SNAPSHOT.jar;
```
```
    注册自定义函数（是一个临时函数，下次使用时需要重新注册），把jar包和函数名绑定
```
```
create temporary function itcastfunc as 'cn.hive.utf.idcastUDF';//itcastfunc是自定义的函数名
```
可以调用函数了
```
select itcastfunc('ABC') from dual;
```

hive函数的高阶特性

UDTF函数–explode

T:表生成函数，可以输入一行，输出多行
explode，接收的是array或者map 数组类型

如果是字符串，可以使用split把它切割开

如果explode(array):会把array中的每一个元素变成一行

如果explode(map):会把kv对变成一行，其中k一列，v一列
explode的使用

001,allen,usa|china|japan,1|3|7
002,kobe,usa|england|japan,2|3|5

create table test_message(id int,name string,location array<string>,city array<int>) row format delimited fields terminated by ","
collection items terminated by '|';load data local inpath "/root/hivedata/test_message.txt" into table test_message;

使用explode

select explode(location) from test_message;如果查询location对应的name
select name,explode(location) from test_message; 报错
当使用UDTF函数的时候,hive只允许对拆分字段进行访问的。
name属于真实的表，explode(location)是虚拟的表中的，不能同时取出

lateral view 侧视图（把虚拟的表和真实的表之间做一个关联）

该语法的使用场景，就是为了配合UDTF函数的使用

UDTF函数可以把一行变成多行，相当于产生了一个新的表，但是是虚拟的表，和原来的表没有关联
```
select name,xubiao.* from test_message lateral view explode(location) xubiao as xubiaoziduan;//后面的as是给查询出来的字段起个别名，如果自己不起就是默认随机生成的
```
如果想在select语法中，纪要查询原来表中的数据，又要查询虚拟表中的数据需要使用lateral把真实的表和虚拟的表进行关联
```
真实表名  lateral view UDTFs(xxxx) 侧视图名（虚拟表名） as 字段1，字段2
```

Hive中行列转化

重要

两种转化面试必问

hive中的类型转化函数

类型转化

cast(列 as 数据类型)
select cast(12 as double) from dual;

行列转化的使用

多行转单列

a）concat_ws(参数1，参数2)，用于进行字符的拼接 参数1—指定分隔符 参数2—拼接的内容
b）collect_set(col3)，它的主要作用是将某字段的值进行去重汇总，产生array类型字段如果不想去重可用collect_list()

创建表：
create table row2col_1(col1 string,col2 string,col3 int)
row format delimited
fields terminated by ',';加载数据：
load data local inpath '/root/hivedata/row2col_1.txt' into table row2col_1;
a,b,1
a,b,2
a,b,3
c,d,4
c,d,5
c,d,6


select col1, col2, concat_ws('|', collect_set(cast(col3 as string))) as col3
from row2col_1
group by col1, col2;

单列转多行

explode函数和lateral view侧视图

注意：explode只接受array 或者map类型如果字段不是该类型想法设法转换成array类型

select explode(split(col3,",")) from  col2row_21;

创建表：
方式一：
create table col2row_2(col1 string,col2 string,col3 Array<string>)
row format delimited
fields terminated by '\t'
collection items terminated by ',';
方式二：
create table col2row_2(col1 string,col2 string,col3 string)
row format delimited
fields terminated by '\t';加载数据：
load data local inpath '/root/hivedata/col2row_2.txt' into table col2row_2;a  b   1,2,3c  d   4,5,6单列转多行：
select explode(split(col3,",")) from  col2row_2;//使用方式二创建表的写法
select explode(col3) from  col2row_21;//使用方式一创建表的写法

reflect函数

reflet函数

可以调用java中任何一个已经存在的函数

--创建hive表
create table test_udf(col1 int,col2 int) row format delimited fields terminated by ',';--准备数据 test_udf.txt
1,2
4,3
6,4
7,5
5,6--加载数据load data local inpath '/root/hivedata/test_udf.txt'  into table test_udf;--使用java.lang.Math当中的Max求两列当中的最大值
select reflect("java.lang.Math","max",col1,col2) from test_udf;

区别：使用select max(col1) 之类的聚合函数是查找某一列上的最大值

  使用reflect是查看某一行的最值

也可以使用utils包

select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123");