hive窗口函数入门

窗口函数over简介

先来看一下这个需求：求每个部门的员工信息以及部门的平均工资。在mysql中如何实现呢

SELECT emp.*, avg_sal
FROM empJOIN (SELECT deptno, round(AVG(ifnull(sal, 0))) AS avg_salFROM empGROUP BY deptno) tON emp.deptno = t.deptno
ORDER BY deptno;select emp.*,(select avg(ifnull(sal,0)) from emp B where B.deptno = A.deptno )
from emp A;

通过这个需求我们可以看到，如果要查询详细记录和聚合数据，必须要经过**两次查询**，比较麻烦。

这个时候，我们使用窗口函数，会方便很多。那么窗口函数是什么呢？

特点

-1) 窗口函数又名开窗函数，属于分析函数的一种。
-2) 是一种用于解决复杂报表统计需求的函数。
-3) 窗口函数常用于计算基于组的某种值，它和聚合函数的不同之处是：对于每个组返回多行，而聚合函数对于每个组只返回一行。
简单的说窗口函数对每条详细记录开一个窗口,进行聚合统计的查询
-4) 开窗函数指定了分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变化而变化。
-5) 窗口函数一般不单独使用
-6) 窗口函数内也可以分组和排序

参考下图：

**注意：**默认mysql老版本没有支持,在最新的8.0版本中支持, Oracle和Hive中都支持窗口函数

基本案例演示

数据准备(order.txt)

姓名 购买日期购买数量
saml    2018-01-01      10
saml    2018-01-08      55
tony    2018-01-07      50
saml    2018-01-05      46
tony    2018-01-04      29
tony    2018-01-02      15
saml    2018-02-03      23
mart    2018-04-13      94
saml    2018-04-06      42
mart    2018-04-11      75
mart    2018-04-09      68
mart    2018-04-08      62
neil    2018-05-10      12
neil    2018-06-12      80

drop table t_order;
--1. 创建order表:
create table if not exists t_order
(name      string,orderdate string,cost      int
) row format delimited fields terminated by '\t';
--2. 加载数据:
load data local inpath "/data/order.txt" into table t_order;
select * from t_order;

需求：查询每个订单的信息，以及订单的总数

– 1.不使用窗口函数

-- 查询所有明细
select * from t_order;
# 查询总量
select count(*) from t_order;

– 2.使用窗口函数：通常格式为可用函数+over()函数

select *, count(*) over() from t_order;
+----+----------+----+--+
|name|orderdate |cost|c1|
+----+----------+----+--+
|    |NULL      |NULL|15|
|neil|2018-06-12|80  |15|
|neil|2018-05-10|12  |15|
|mart|2018-04-08|62  |15|
|mart|2018-04-09|68  |15|
|mart|2018-04-11|75  |15|
|saml|2018-04-06|42  |15|
|mart|2018-04-13|94  |15|
|saml|2018-02-03|23  |15|
|tony|2018-01-02|15  |15|
|tony|2018-01-04|29  |15|
|saml|2018-01-05|46  |15|
|tony|2018-01-07|50  |15|
|saml|2018-01-08|55  |15|
|saml|2018-01-01|10  |15|
+----+----------+----+--+

注意:

窗口函数是针对每一行数据的.

如果over中没有指定参数,默认窗口大小为全部结果集

需求:查询在2018年1月份购买过的顾客购买明细及总人数

select *,count(*) over() from t_order
where substring(orderdate,1,7)='2018-01'
+----+----------+----+--+
|name|orderdate |cost|c1|
+----+----------+----+--+
|tony|2018-01-02|15  |6 |
|tony|2018-01-04|29  |6 |
|saml|2018-01-05|46  |6 |
|tony|2018-01-07|50  |6 |
|saml|2018-01-08|55  |6 |
|saml|2018-01-01|10  |6 |
+----+----------+----+--+

distribute by子句

在over窗口中进行分组,对某一字段进行分组统计,窗口大小就是同一个组的所有记录

语法：
over(distribute by colname[,colname.....])

需求:查看顾客的购买明细及月购买总额

select *,sum(cost) over(distribute by month(orderdate)) sum
from t_order;+----+----------+----+----+
|name|orderdate |cost|sum |
+----+----------+----+----+
|    |NULL      |NULL|NULL|
|saml|2018-01-01|10  |205 |
|tony|2018-01-02|15  |205 |
|tony|2018-01-04|29  |205 |
|saml|2018-01-05|46  |205 |
|tony|2018-01-07|50  |205 |
|saml|2018-01-08|55  |205 |
|saml|2018-02-03|23  |23  |
|mart|2018-04-13|94  |341 |
|mart|2018-04-08|62  |341 |
|mart|2018-04-09|68  |341 |
|mart|2018-04-11|75  |341 |
|saml|2018-04-06|42  |341 |
|neil|2018-05-10|12  |12  |
|neil|2018-06-12|80  |80  |
+----+----------+----+----+

需求:查看顾客的购买明细及每个顾客的月购买总额

select *,sum(cost) over(distribute by month(orderdate),name)
from t_order;
+----+----------+----+----+
|name|orderdate |cost|c1  |
+----+----------+----+----+
|    |NULL      |NULL|NULL|
|saml|2018-01-01|10  |111 |
|saml|2018-01-05|46  |111 |
|saml|2018-01-08|55  |111 |
|tony|2018-01-02|15  |94  |
|tony|2018-01-04|29  |94  |
|tony|2018-01-07|50  |94  |
|saml|2018-02-03|23  |23  |
|mart|2018-04-09|68  |299 |
|mart|2018-04-11|75  |299 |
|mart|2018-04-13|94  |299 |
|mart|2018-04-08|62  |299 |
|saml|2018-04-06|42  |42  |
|neil|2018-05-10|12  |12  |
|neil|2018-06-12|80  |80  |
+----+----------+----+----+

sort by子句

sort by子句会让输入的数据强制排序（强调：当使用排序时，窗口会在组内逐行变大）

语法：  over([distribute by colname] [sort by colname [desc|asc]])

需求:查看顾客的购买明细及每个顾客的月购买总额,并且按照日期降序排序

select *, sum(cost) over (distribute by month(orderdate),name sort by orderdate desc)
from t_order;
+----+----------+----+----+
|name|orderdate |cost|c1  |
+----+----------+----+----+
|    |NULL      |NULL|NULL|
|saml|2018-01-08|55  |55  |
|saml|2018-01-05|46  |101 |
|saml|2018-01-01|10  |111 |
|tony|2018-01-07|50  |50  |
|tony|2018-01-04|29  |79  |
|tony|2018-01-02|15  |94  |
|saml|2018-02-03|23  |23  |
|mart|2018-04-13|94  |94  |
|mart|2018-04-11|75  |169 |
|mart|2018-04-09|68  |237 |
|mart|2018-04-08|62  |299 |
|saml|2018-04-06|42  |42  |
|neil|2018-05-10|12  |12  |
|neil|2018-06-12|80  |80  |
+----+----------+----+----+

注意：可以使用partition by + order by 组合来代替distribute by+sort by组合

select name, orderdate, cost,
sum(cost) over (partition by name, month(orderdate) order by orderdate desc)
from t_order;

注意：也可以在窗口函数中，只写排序，窗口大小是全表记录。
实际上就是把整个表当作一个组了. 也是有这个需求的,毕竟不开窗不能看明细,然而又想当成一个组

select *, sum(cost) over (order by orderdate desc)
from t_order;+----+----------+----+---+
|name|orderdate |cost|c1 |
+----+----------+----+---+
|neil|2018-06-12|80  |80 |
|neil|2018-05-10|12  |92 |
|mart|2018-04-13|94  |186|
|mart|2018-04-11|75  |261|
|mart|2018-04-09|68  |329|
|mart|2018-04-08|62  |391|
|saml|2018-04-06|42  |433|
|saml|2018-02-03|23  |456|
|saml|2018-01-08|55  |511|
|tony|2018-01-07|50  |561|
|saml|2018-01-05|46  |607|
|tony|2018-01-04|29  |636|
|tony|2018-01-02|15  |651|
|saml|2018-01-01|10  |661|
|    |NULL      |NULL|661|
+----+----------+----+---+

Window子句

如果要对窗口的结果做更细粒度的划分,那么就使用window子句,常见的有下面几个

PRECEDING：往前
FOLLOWING：往后
CURRENT ROW：当前行
UNBOUNDED：起点，
UNBOUNDED PRECEDING：表示从前面的起点，
UNBOUNDED FOLLOWING：表示到后面的终点

一般window子句都是rows开头

案例:

select name,orderdate,cost,sum(cost) over ()                                                                                       as sample1,--所有行相加sum(cost) over (partition by name)                                                                      as sample2,-- 按name分组，组内数据相加sum(cost) over (partition by name order by orderdate)                                                   as sample3,-- 按name分组，组内数据累加sum(cost)over (partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row )       as sample4,-- 与sample3一样，由起点到当前行的聚合sum(cost)over (partition by name order by orderdate rows between 1 PRECEDING and current row)                as sample5, -- 当前行和前面一行做聚合sum(cost)over (partition by name order by orderdate rows between 1 PRECEDING AND 1 FOLLOWING )               as sample6,-- 当前行和前边一行及后面一行sum(cost)over (partition by name order by orderdate rows between current row and UNBOUNDED FOLLOWING )       as sample7 -- 当前行及后面所有行from t_order;
+----+----------+----+-------+-------+-------+-------+-------+-------+-------+
|name|orderdate |cost|sample1|sample2|sample3|sample4|sample5|sample6|sample7|
+----+----------+----+-------+-------+-------+-------+-------+-------+-------+
|    |NULL      |NULL|661    |NULL   |NULL   |NULL   |NULL   |NULL   |NULL   |
|mart|2018-04-08|62  |661    |299    |62     |62     |62     |130    |299    |
|mart|2018-04-09|68  |661    |299    |130    |130    |130    |205    |237    |
|mart|2018-04-11|75  |661    |299    |205    |205    |143    |237    |169    |
|mart|2018-04-13|94  |661    |299    |299    |299    |169    |169    |94     |
|neil|2018-05-10|12  |661    |92     |12     |12     |12     |92     |92     |
|neil|2018-06-12|80  |661    |92     |92     |92     |92     |92     |80     |
|saml|2018-01-01|10  |661    |176    |10     |10     |10     |56     |176    |
|saml|2018-01-05|46  |661    |176    |56     |56     |56     |111    |166    |
|saml|2018-01-08|55  |661    |176    |111    |111    |101    |124    |120    |
|saml|2018-02-03|23  |661    |176    |134    |134    |78     |120    |65     |
|saml|2018-04-06|42  |661    |176    |176    |176    |65     |65     |42     |
|tony|2018-01-02|15  |661    |94     |15     |15     |15     |44     |94     |
|tony|2018-01-04|29  |661    |94     |44     |44     |44     |94     |79     |
|tony|2018-01-07|50  |661    |94     |94     |94     |79     |79     |50     |
+----+----------+----+-------+-------+-------+-------+-------+-------+-------+

需求:查看顾客到目前为止的购买总额

select name,orderdate,cost,sum(cost)over (partition by name order by orderdate rows between unbounded preceding and current row ) as allCount
from t_order;
+----+----------+----+--------+
|name|orderdate |cost|allcount|
+----+----------+----+--------+
|    |NULL      |NULL|NULL    |
|mart|2018-04-08|62  |62      |
|mart|2018-04-09|68  |130     |
|mart|2018-04-11|75  |205     |
|mart|2018-04-13|94  |299     |
|neil|2018-05-10|12  |12      |
|neil|2018-06-12|80  |92      |
|saml|2018-01-01|10  |10      |
|saml|2018-01-05|46  |56      |
|saml|2018-01-08|55  |111     |
|saml|2018-02-03|23  |134     |
|saml|2018-04-06|42  |176     |
|tony|2018-01-02|15  |15      |
|tony|2018-01-04|29  |44      |
|tony|2018-01-07|50  |94      |
+----+----------+----+--------+

需求：求每个顾客最近三次的消费总额

select name,orderdate,cost,sum(cost)over (partition by name order by orderdate rows between 2 preceding  and current row ) as allCount
from t_order;
+----+----------+----+--------+
|name|orderdate |cost|allcount|
+----+----------+----+--------+
|    |NULL      |NULL|NULL    |
|mart|2018-04-08|62  |62      |
|mart|2018-04-09|68  |130     |
|mart|2018-04-11|75  |205     |
|mart|2018-04-13|94  |237     |
|neil|2018-05-10|12  |12      |
|neil|2018-06-12|80  |92      |
|saml|2018-01-01|10  |10      |
|saml|2018-01-05|46  |56      |
|saml|2018-01-08|55  |111     |
|saml|2018-02-03|23  |124     |
|saml|2018-04-06|42  |120     |
|tony|2018-01-02|15  |15      |
|tony|2018-01-04|29  |44      |
|tony|2018-01-07|50  |94      |
+----+----------+----+--------+

总结

窗口函数的意义在于明细+聚合,二者缺一不可
理解分组的大小雨窗口大小的关系
window子句是比Partition by更细粒度的统计
如果既要明细又要聚合,就要用到开窗函数