文章目录

Apache Hive DML语句与函数使用
- 一、Hive SQL DML语法之**加载数据**
- - （1）. 掌握Hive SQL Load加载数据语句
- 什么是本地？
- - （2）.掌握Hive SQL Insert插入数据语句
- 二、Hive SQL DML语法之查询数据
- - （1)掌握Hive SQL Select语法介绍
  - （2)select_expr、ALL DISTINCT结束返回与去重
  - （3)WHERE 过滤
  - （4) 聚合操作
  - （5) GROUP BY分组
  - - GROUP BY语法限制：
  - （6)HAVING 分组后过滤
  - （7) ORDER BY 排序
  - （8)LIMIT限制
  - （9)select执行顺序
  - (10)HAVING和WHERE区别
- 三、Hive SQL Join关联查询
- - （1）.掌握Hive SQL Join查询语句
  - - inner join 内连接
    - left join 左连接
- 四、Hive SQL 中的函数使用
- - （1）.掌握Hive SQL 常用函数的使用
  - - 常用内置函数

Apache Hive DML语句与函数使用

一、Hive SQL DML语法之加载数据

（1）. 掌握Hive SQL Load加载数据语句

Load英文单词的含义为：加载、装载

所谓加载是指：将数据文件移动到Hive表对应的位置，移动是是纯复制（从本地加载）、移动操作（从HDFS）

纯复制、移动是指在数据load加载到表中时，Hive不会对表中数据内容进行任何转换、任何操作

什么是本地？

如果对hiveServer2服务运行此命令

即本地系统就是hive服务器所在的位置，源文件所在的位置，远程连接到的是node1的位置。

第一种方法：

--step1:建表
--建表student_local 用于演示从本地加载数据
create table student_local(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';0: jdbc:hive2://master:10000> load data local inpath /root/hivedata/student.txt into table tyh.student_local;
---从结果可以看出是从本地文件
--Loading data to table tyh.student_local from file:/root/hivedata/students.txt0: jdbc:hive2://master:10000>select * from tyh.student_local;--建表student_HDFS  用于演示从HDFS加载数据
create table student_HDFS(num int,name string,sex string,age int,dept string) row format delimited fields terminated by ',';0: jdbc:hive2://master:10000>load data inpath /student.txt into table tyh.student_hdfs;---从结果可以看出是从hdfs文件
-- Loading data to table tyh.student_hdfs from hdfs://master:8020/students.txt0: jdbc:hive2://master:10000>select * from tyh.student_hdfs;

第二种方法：

[root@master ~]# mkdir hivedata
[root@master ~]# cd hivedata
[root@master ~]# vim 1.txt
[root@s3 ~]# /export/server/apache-hive-3.1.2-bin/bin/beeline0: jdbc:hive2://master:10000> create table t_1(id int,name string,year int)row format delimited fields terminated by ",";
[root@master hivedata]# hadoop fs -put 1.txt /user/hive/warehouse/t_1
0: jdbc:hive2://master:10000> select * from t_1;
+---------+-----------+-----------+
| t_1.id  | t_1.name  | t_1.year  |
+---------+-----------+-----------+
| 1       | ttt       | 11        |
| 2       | zhangsan  | 12        |
| 3       | lisi      | 13        |
+---------+-----------+-----------+

（2）.掌握Hive SQL Insert插入数据语句

insert +select：将后面查询返回的结果作为内容插入到指定表中

需要保证查询结果列的数目和需要插入数据表格的列数目一致。
如果查询出来的数据类型和插入表格对应的列数据类型不一致，将会进行转换，但是不能保证转换一定成功，转换失败的数据将会为NULL

***
--step1:创建一张源表student
create table student(num int,name string,sex string,age int,dept string)
row format delimited fields terminated by ',';--step2:加载数据
load data local inpath '/root/hivedata/students.txt' into table student;select * from student;--step3：创建一张目标表  只有两个字段
create table student_from_insert(sno int,sname string);--使用insert+select插入数据到新表中
insert into table student_from_insert select num,name from student;select * from student_from_insert;

二、Hive SQL DML语法之查询数据

（1)掌握Hive SQL Select语法介绍

--创建表t_usa_covid19
drop table if exists t_usa_covid19;
CREATE TABLE t_usa_covid19(count_date string,county string,state string,fips int,cases int,deaths int)
row format delimited fields terminated by ",";--将数据load加载到t_usa_covid19表对应的路径下
load data local inpath '/root/hivedata/us-covid19-counties.dat' into table t_usa_covid19;---查询表内容
select *  from t_usa_covid19;

（2)select_expr、ALL DISTINCT结束返回与去重

--1、select_expr
--查询所有字段或者指定字段
select *  from t_usa_covid19;//技巧：选中表格按住crtl+Q 就会出现表的字段信息select county, cases, deaths from t_usa_covid19;
--查询常数返回 此时返回的结果和表中字段无关
select 1 from t_usa_covid19;
--查询当前数据库
select current_database(); --省去from关键字**2、ALL DISTINCT**
--返回所有匹配的行
select state from t_usa_covid19;
--相当于
select all state from t_usa_covid19;--返回所有匹配的行 去除重复的结果
select distinct state from t_usa_covid19;
--多个字段distinct 整体去重
select distinct county,state from t_usa_covid19;//整体字段综合进行比较

（3)WHERE 过滤

--3、WHERE CAUSE
select * from t_usa_covid19 where 1 > 2;  -- 1 > 2 返回false
select * from t_usa_covid19 where 1 = 1;  -- 1 = 1 返回true--找出来自于California州的疫情数据
select * from t_usa_covid19 where state = 'California';
--where条件中使用函数 找出州名字母长度超过10位的有哪些
select * from t_usa_covid19 where length(state) >10 ;

（4) 聚合操作

SQL中拥有很多可用于计数和计算的内建函数，其使用的语法是：SELECT functon(列) FROM表
聚合操作函数，如：COUTNT 、SUM、Max、Min、Avg等函数
聚合函数的最大特点是不管原始数据有多少行记录，经过聚合操着只返回一条数据，这一条数据就是聚合的结果

4、聚合操作
--统计美国总共有多少个县county
select county as tyh from t_usa_covid19;
--学会使用as 给查询返回的结果起个别名
select count(county) as county_cnts from t_usa_covid19;
--去重distinct
select count(distinct county) as county_cnts from t_usa_covid19;--统计美国加州有多少个县
select count(county) from t_usa_covid19 where state = "California";
--统计德州总死亡病例数
select sum(deaths) from t_usa_covid19 where state = "Texas";
--统计出美国最高确诊病例数是哪个县
select max(cases) from t_usa_covid19;

（5) GROUP BY分组

GPOUP BY语句用于聚合操作，根据一个或多个列队结果集进行分组

GROUP BY语法限制：

出现在GROUP BY中select_expr的字段：要么是GROUP BY分组字段；要么是在被聚合函数应用的字段
原因：避免出现一个字段多个值的歧义
1. 分组字段出现select_expr中，一定没有歧义，因为就是基于该字段分组的，同一组中必相同
2. 被聚合函数应用的字段，也没有歧义，因为聚合函数的本质就是多进一出，最终返回一个结果

--被聚合函数应用
select *
from t_usa_covid19;--根据state州进行分组 统计每个州有多少个县county
select count(county) from t_usa_covid19 where count_date = "2021-01-28" group by state;--想看一下统计的结果是属于哪一个州的
select state,count(county) as county_nums from t_usa_covid19 where count_date = "2021-01-28" group by state;
--再想看一下每个县的死亡病例数
select state,count(county),sum(deaths) from t_usa_covid19 where count_date = "2021-01-28" group by state;

（6)HAVING 分组后过滤

在SQL中增加HAVING子句原因是，WHERE关键字无法与聚合函数一起使用

HAVING子句可以让我们筛选分组后的各组数据，并且可以在HAVING中使用聚合函数，因此此时where，group by已经执行结束，结果集已经确定。

--6、having
--统计2021-01-28死亡病例数大于10000的州
select state,sum(deaths) from t_usa_covid19 where count_date = "2021-01-28" and sum(deaths) >10000 group by state;
--where语句中不能使用聚合函数 语法报错--先where分组前过滤，再进行group by分组， 分组后每个分组结果集确定 再使用having过滤
select state,sum(deaths) from t_usa_covid19 where count_date = "2021-01-28" group by state having sum(deaths) > 10000;
--这样写更好 即在group by的时候聚合函数已经作用得出结果 having直接引用结果过滤 不需要再单独计算一次了
select state,sum(deaths) as cnts from t_usa_covid19 where count_date = "2021-01-28" group by state having cnts> 10000;

（7) ORDER BY 排序

ORDER BY语句根据指定的列对结果集进行排序
ORDER BY默认按照升序（ASC）对记录进行排序。如果您希望按照降序对记录进行排序，可以使用DESC关键字

  --7、order by--根据确诊病例数升序排序 查询返回结果select * from t_usa_covid19 ;select * from t_usa_covid19 order by cases;--不写排序规则 默认就是asc升序select * from t_usa_covid19 order by cases asc;--根据死亡病例数倒序排序 查询返回加州每个县的结果select * from t_usa_covid19 where state = "California" order by cases desc;

（8)LIMIT限制

LIMIT用于限制SELECT语句返回的行数
LIMIT接受一个或两个数字参数，这两个参数都必须是非负整数常量
第一个参数指定要返回的第一行的偏移量，第二个参数指定要返回的最大行数。当给出单个参数时，它代表最大行数，并且偏移量为0.

--8、limit
--没有限制返回2021.1.28 加州的所有记录
select * from t_usa_covid19 where count_date = "2021-01-28" and state ="California";--返回结果集的前5条
select * from t_usa_covid19 where count_date = "2021-01-28" and state ="California" limit 5;--返回结果集从第1行开始 共3行
select * from t_usa_covid19 where count_date = "2021-01-28" and state ="California" limit 2,3;
--注意 第一个参数偏移量是从0开始的

（9)select执行顺序

在查询过程中执行顺序：from>where>group(含聚合)>having>order>select
聚合语句（sum,min,max,avg,count）要比having子句优先执行
where子句在查询过程中执行优先级别优先于聚合语句（sum,min,max,avg,count）

--执行顺序
select state,sum(deaths) as cnts from t_usa_covid19
where count_date = "2021-01-28"
group by state
having cnts> 10000
limit 2;

(10)HAVING和WHERE区别

having是在分组后对数据进行过滤
where实在分组前对数据进行过滤
having后面可以使用聚合函数
where后面不可以使用聚合函数

三、Hive SQL Join关联查询

（1）.掌握Hive SQL Join查询语句

join语法的出现是用于根据两个或多个表中列之间的关系，从这些表中共同组合查询数据
在Hive中，使用最多，最重要的两种join分别是：

inner join(内连接)、left join(左连接)

--准备环境
--table1: 员工表
CREATE TABLE employee(id int,name string,deg string,salary int,dept string) row format delimited
fields terminated by ',';--table2:员工家庭住址信息表
CREATE TABLE employee_address (id int,hno string,street string,city string
) row format delimited
fields terminated by ',';--table3:员工联系方式信息表
CREATE TABLE employee_connection (id int,phno string,email string
) row format delimited
fields terminated by ',';--加载数据到表中
load data local inpath '/root/hivedata/employee.txt' into table employee;
load data local inpath '/root/hivedata/employee_address.txt' into table employee_address;
load data local inpath '/root/hivedata/employee_connection.txt' into table employee_connection;select *
from employee;select *
from employee_address;select *
from employee_connection;

inner join 内连接

内连接是最常见的一种连接，它也被称为普通连接，其中inner可以省略：inner join==join；
只有进行连接的两个表中都存在与连接条件相匹配的数据才会被留下来

--1、inner join
select e.id,e.name,e_a.city,e_a.street
from employee e inner join employee_address e_a//给employee取别名为e
on e.id =e_a.id;--等价于 inner join=join
select e.id,e.name,e_a.city,e_a.street
from employee e join employee_address e_a
on e.id =e_a.id;--等价于 隐式连接表示法
select e.id,e.name,e_a.city,e_a.street
from employee e , employee_address e_a
where e.id =e_a.id;

left join 左连接

left join中文叫做是左外连接或者左连接，其中outer可以省略，left outer join是早期的写法
left join的核心就在于left左。左指的是join关键字左边的表，简称左表
join时以左表的全部数据为准，右边与之关联；左表数据全部返回，右表关联上的显示返回，关联不上的显示null返回

--2、left join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e left join employee_connection e_conn
on e.id =e_conn.id;--等价于 left outer join
select e.id,e.name,e_conn.phno,e_conn.email
from employee e left outer join  employee_connection e_conn
on e.id =e_conn.id;

四、Hive SQL 中的函数使用

（1）.掌握Hive SQL 常用函数的使用

概述

Hive内建了不少函数，用于满足用户不同使用需求,提高SQL编写效率:

1.使用show functions查看当下可用的所有函数;

2.通过describe function extended funcname来查看函数的使用方式。

show functions;
describe function extended count;

分类标准

Hive的函数分为两大类:内置函数( Built-in Functions )、用户定义函数UDF ( User-Defined Functions )

1.内置函数可分为 :数值类型函数、日期类型函数、字符串类型函数、集合函数、条件函数等;
2.用户定义函数根据输入输出的行数可分为3类: UDF、UDAF、 UDTF。

用户定义函数UDF分类标准

根据函数输入输出的行数 :

UDF ( User-Defined-Function )普通函数,一进一出

UDAF ( User-Defined Aggregation Function )聚合函数,多进一出

UDTF ( User-Defined Table-Generat ing Functions )表生成函数,一进多出

常用内置函数

------------String Functions 字符串函数------------
select length("itcast");
select reverse("itcast");select concat("angela","baby");
--带分隔符字符串连接函数：concat_ws(separator, [string | array(string)]+)
select concat_ws('.', 'www', array('itcast', 'cn'));--字符串截取函数：substr(str, pos[, len]) 或者  substring(str, pos[, len])
select substr("angelababy",-2); --pos是从1开始的索引，如果为负数则倒着数
select substr("angelababy",2,2);--结果为：从第二个开始起的共二个  结果为ng
--分割字符串函数: split(str, regex)
--split针对字符串数据进行切割  返回是数组array  可以通过数组的下标取内部的元素 注意下标从0开始的
select split('apache hive', ' ');
select split('apache hive', ' ')[0];
select split('apache hive', ' ')[1];----------- Date Functions 日期函数 -----------------
--获取当前日期: current_date
select current_date();
--获取当前UNIX时间戳函数: unix_timestamp
select unix_timestamp();
--日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp("2011-12-07 13:01:03");
--指定格式日期转UNIX时间戳函数: unix_timestamp
select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss');
--UNIX时间戳转日期函数: from_unixtime
select from_unixtime(1618238391);
select from_unixtime(0, 'yyyy-MM-dd HH:mm:ss');--日期比较函数: datediff  日期格式要求'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd'
select datediff('2012-12-08','2012-05-09');
--日期增加函数: date_add
select date_add('2012-02-28',10);
--日期减少函数: date_sub
select date_sub('2012-01-1',10);----Mathematical Functions 数学函数-------------
--取整函数: round  返回double类型的整数值部分 （遵循四舍五入）
select round(3.1415926);
--指定精度取整函数: round(double a, int d) 返回指定精度d的double类型
select round(3.1415926,4);
--取随机数函数: rand 每次执行都不一样 返回一个0到1范围内的随机数
select rand();
--指定种子取随机数函数: rand(int seed) 得到一个稳定的随机数序列
select rand(3);-----Conditional Functions 条件函数------------------
--使用之前课程创建好的student表数据
select * from student limit 3;--if条件判断: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
select if(1=2,100,200);
select if(sex ='男','M','W') from student limit 3;--空值转换函数: nvl(T value, T default_value)
select nvl("allen","itcast");
select nvl(null,"itcast");--条件转换函数: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end;
select case sex when '男' then 'male' else 'female' end from student limit 3;