HIve之DML 聚合分组应用函数静动态分区表

DML 查询的相关的

desc  xxx
desc formatted  xxxx
select   *  from   xxxx  这里也可以指定字段 工作的时候一般都是hi指定字段的
select  * from  xxxx  where  xx=xx
select * from xxxx  where sal between 800 and16000； limit
select * from xxxx  where sal  in（“xxx”，“zzz”）;  in也可以查询在这个之间的这里面是字符间隔，比如姓名not in  不在这之间的 where comm is not null         ！=  组合是不等于的意思

以后再处理日志的时候，很多日志是不规范的，所以我们要考虑不同的情况
这些基本的查询不会跑mapreduce

Hive构建在Hadoop之上的数据仓库
sql ==> Hive ==> MapReduce

聚合函数： max min sum avg count

分组函数：出现在select中的字段，要么出现在group by子句中，要么出现在聚合函数中
求部门的平均工资 select deptne，avg（sal）from ruoze_emp group by deptne;
求每个部门、工作岗位的最高工资 select deptno,jop,max(sal) from ruoze_emp group by deptno ,jop;
求每个部门的平均薪水大于2000的部门
select deptno,avg(sal) from ruoze_emp group by deptno having avg(sal)>2000;
where 是作用于所有之上的，hiving是作用于分组之后的

case when then if-else
如果怎么样就怎么样

select ename, sal,
case 如果
when sal>1 and sal<=1000 then 'LOWER'
when sal>1000 and sal<=2000 then 'MIDDLE'
when sal>2000 and sal<=4000 then 'HIGH'
ELSE 'HIGHEST' end
from ruoze_emp;

出报表的时候会用到

union all的使用

select count(1) from ruoze_emp where deptno=10
union all
select count(1) from ruoze_emp where deptno=20;

这用在数据倾斜场景比较多
a = a1 union all a2
a表是倾斜的，把a表分为a1表倾斜 a2表不倾斜用一个临时表
把a1和a2各自的结果统计出来，用一个临时表，然后用union all 就拿到最终的结果了
把正常的拿出来，不正常的拿出来，分别处理，把处理结果在同一起来

寻找hive的函数，
**在hive官网，**https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
hive网页》Hive wiki 》 Operators and UDFs 找到函数 Operators操作符运算符
在xshell中的

>show functions; 描述的是hive里面所有支持的内置函数
查看具体的一个函数的用法
>desc function  extended max;     后面跟上你的函数

时间函数
select 加时间函数解读一下官网 hive3视频 40分钟额位置

hive的函数 cast 类型装换

cast (value as TYPE)
select comm ，cast（comm as int）from ruoze；

在这里如果转换失败，返回值就是null

substr
select substr （“adfghhs”，2，3）from ruoze
从第二个取三个字母

substring

concat 是把2个字符串合在一起合并多个数据或则字符串
select concat （“ruoze”，’“jeson”）from ruoze
concat_ws
select concat_ws （“.”，“192”，“162”，“135”）from ruoze
以点分割把他们合在一起这个函数工作当中用的非常多

length长度一个字段的长度
select length （“192.168.0.20.0”）from ruoze
select ename，length（ename）from ruoze

explode函数是把数组分行

1,doudou,化学:物理:数学:语文
2,dasheng,化学:数学:生物:生理:卫生
3,rachel,化学:语文:英语:体育:生物

create table ruoze_student(
id int,
name string,
subjects array<string>数组类型
)row format delimited fields terminated by ','
COLLECTION ITEMS TERMINATED BY ':';  这里的意思是，我们的数据里面是：进行分割的，集合里面的分隔符

collection
n. 采集，聚集; [税收] 征收; 收藏品; 募捐

加载数据，里面含有集合

load data local inpath '/home/hadoop/data/student.txt' into table ruoze_student;

我们要集合去重复

select distnct s.sub from (select explode(subjects） as sub from ruoze_studens) s ;

面试题需求：使用hive完成wordcount

先创建一张表

create table ruoze_wc(
sentence string
);

导入数据

load data local inpath '/home/hadoop/data/student.txt' into table ruoze_wc;

基于这张表我们要做wordcount操作。
第一步要切分 select split（sentence，“，”）from ruoze_wc; 拆成数组
第二部是把每个数组拆分出来 select explode（split（sentence，“，”））from

ruoze_wc; select word, count(1) as c
from
(
select explode(split(sentence,",")) as word from ruoze_wc
) t group by word   这里的t要加上。这是一个子查询，要加个别名 ，虽然没有用，但是的加上
order by c desc;  排序

分区表：一个表按照某个字段进行分区

分区的意思何在，

求时间 2018年10月21日22：00到2018年10月21日23:59的数据
startime>201810212200 and starttime < 201810212359

access.log 很大的一张表每天的数据都在这里面

是把这张表读取出来，然后全表去扫描，这种性能是很低的
所以一般情况做分区
这张表存在的路径
/user/hive/warehouse/access/d=20181021 d是每天做分区
减少很多io

分区表的创建

create table order_partition(
order_Number string,
event_time string
)PARTITIONED BY(event_month string)  分区字段
row format delimited fields terminated by '\t';
加载数据
load data local inpath '/home/hadoop/data/order.txt' into table order_partition PARTITION (event_month='2014-05');表名          创建分区表这里要 指定分区

注意：

去看一下日志
cd 切换到root用户下面
cd /tmp/hadoop
ls
tail -200f hive.log
这里挂了
解决改变mysql设置，不能改变已经存在的表，你需要转换表的编辑
先把hive关掉
切换到mysql数据库
mysql> 把下面的复制一下到数据库

use ruoze_d5;
alter table PARTITIONS convert to character set latin1;
alter table PARTITION_KEYS convert to character set latin1;

在重新启动一下hive

hive>load data local inpath ‘/home/hadoop/data/order.txt’ into table order_partition PARTITION (event_month=‘2014-05’);
在加载一下，数据进来了

在去hadoop上面看一下
hadoop fs -ls /user/hive/warehouse 这里会有不一样的
分区的名称是分区字段=分区值要知道他的目录结构

分区表在hive中查询的时候要把分区字段加上，要不然还是全局扫
select * from order_partition where event_month=‘2014-05’;

这里我们做一个操作，在hadoop上面建了一个分区文件夹 hadoop fs -mkdir -p /user/hive/warehouse/order_partion/event_month=2014-06
在创建的时候，分区字段不一样，我们把之前的event_month='2014-05’这个文件移动到这里
然后去hive，在去查询select * from order_partition where event_month=‘2014-05’;
这时发现是找不到的
hive> msck repair table order_partition；
这时在查询就会发现有了
但是，这个功能不要用，这个功能是刷所有分区的，性能非常低，生产上杜绝使用这个
使用下面的命令，在生产上一定使用这个方式

alter table order_partition add partition(event_month='2014-07');

我们查询有多少分区
show partitions order_partition;

在创建一个表生产上多级分区使用

create table order_mulit_partition(
orderNumber string,
event_time string
)PARTITIONED BY(event_month string, step string)   多级分区，就是多个字段，这个是2分区
row format delimited fields terminated by '\t';

这个加载数据怎么加
load data local inpath ‘/home/hadoop/data/order.txt’ into table order_mulit_partition PARTITION (event_month=‘2014-05’,step=‘1’); 指定分区的时候要与前面相对应

单级分区/多级分区 ==> 静态分区：你导入数据的时候分区字段要写全

show create table ruoze_emp;
就会显示下面的表创建语句的结构

CREATE TABLE `ruoze_emp`(`empno` int, `ename` string, `job` string, `mgr` int, `hiredate` string, `sal` double, `comm` double，`deptno` int)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

然后我们在这个基础上创建分区表

CREATE TABLE `ruoze_emp_partition`(`empno` int, `ename` string, `job` string, `mgr` int, `hiredate` string, `sal` double, `comm` double)
partitioned by(`deptno` int)  这里是以他为分区字段，他的字段是不能出现在 表字段里面的
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';

需求是什么呢
以部门编号作为分区字段，将这张整体表写到分区里面去

之前的做法：

   insert into table ruoze_emp_partition PARTITION(deptno=10)select empno,ename,job,mgr,hiredate,sal,comm from ruoze_emp where deptno=10;这里不能字*。字段要一个一个写，因为我们的分区字段的原因

假设：有1000个deptno 这里指的是分区字段有这么多以上的方法是不行的所以这就是静态分区的弊端

insert overwrite table ruoze_emp_partition PARTITION(deptno)  这里直接写字段名
select empno,ename,job,mgr,hiredate,sal,comm,deptno from ruoze_emp;分区字段要写到最后，如果你有2个，也要相对应对上

这里会报错
>set hive.exec.dynamic.partition.mode=nonstrict; 报错里面会提醒你要你执行这个语句
如果你想全局使用，在hive-site里面配置一下
在执行一下就Ok了