hive窗口函数（over）详解

hive窗口函数：

一.函数说明：

OVER():指定分析函数工作的数据窗口大小，这个数据窗口大小可能会随着行的变而变化

CURRENT ROW:当前行

n PRECEDING:往前n行数据
n FOLLOWING:往后n行数据

UNBOUNDED:起点，UNBOUNDED PRECEDING表示从前面的起点，UNBOUNDED FOLLOWING表示到后面的终点

LAG(col,n)往前第n行数据
LEAD(col,n)往后第n行数据

NTILE(n):把有序分区中的行分发到指定数据的组中，各个组有编号，编号从1开始，对于每一行，NTILE返回此行所属组的编号。注意：n必须为int类型。

二.数据准备：

字段名称：name,orderdate,cost

jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,42
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
mart,2017-05-10,12
mart,2017-04-11,75
mart,2017-06-12,80
mart,2017-04-13,94

三.需求

（1）查询在2017年4月份购买过的顾客及总人数
（2）查询顾客的购买明细及月购买总额
（3）上述的场景，要将cost按照日期进行累加
（4）查询顾客上次的购买时间
（5）查询前20%时间的订单信息

四.准备

1.创建本地business.txt，导入数据
2.创建hive表并导入数据

create table business(
name string,
orderdate string,
cost int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

导入数据

 load data local inpath "/business.txt" into table business;

五.按照需求查询数据

（1）查询在2017年4月份购买过的顾客及总人数

1)不使用窗口函数查询：

select name,count(*)
from business
where substring(orderdate,1,7)="2017-04"
group by name;

查询结果：

name	_c1
jack	1
mart	4

而正确的结果应该是：

name	_c1
jack	2
mart	2

select name,count(*) over()
from business
where substring(orderdate,1,7)="2017-04"
group by name;

（2）查询顾客的购买明细及月购买总额
月购买总额，要先按照月份进行分区，然后将每一个月的花费进行累加。

select *,sum(cost) over(distribute by month(orderdate)) from business;

此处的distribute by相当于partition by（分区）

结果：
name orderdate cost sum_window_0

jack 2017-01-01 10 205
tony 2017-01-02 15 205
tony 2017-01-07 50 205
jack 2017-01-08 55 205
tony 2017-01-04 29 205
jack 2017-01-05 46 205

jack 2017-02-03 23 23

mart 2017-04-11 75 341
jack 2017-04-06 42 341
mart 2017-04-08 62 341
mart 2017-04-09 68 341
mart 2017-04-13 94 341

mart 2017-05-10 12 12

mart 2017-06-12 80 80

总结：
over函数是针对每一行开了一个窗口，但是可以指定开窗的规则，第一个需求是按照一个人开了一个窗口，第二个需求是按照每个月份开了一个窗口。

（3）上述的场景，要将cost按照日期进行累加
按照日期进行累加，首先要对日期进行排序，然后每一条日期的金额等于之前金额的综合。

select *,sum(cost) over(sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;

结果：
name orderdate cost sum_window_0

jack 2017-01-01 10 10
tony 2017-01-02 15 25
tony 2017-01-04 29 54
jack 2017-01-05 46 100
tony 2017-01-07 50 150
jack 2017-01-08 55 205
jack 2017-02-03 23 228
jack 2017-04-06 42 270
mart 2017-04-08 62 332
mart 2017-04-09 68 400
mart 2017-04-11 75 475
mart 2017-04-13 94 569
mart 2017-05-10 12 581
mart 2017-06-12 80 661

每三行一累加：

select *,sum(cost) over(sort by orderdate rows between 1 PRECEDING and 1 FOLLOWING)
from business;

将每个人的花费按照日期进行累加：

select *,sum(cost) over(distribute by name sort by orderdate rows between UNBOUNDED PRECEDING and CURRENT ROW)
from business;

总结：窗口函数（over）中既可以进行排序，也可以进行分区，也可以既排序又分区。而且一个select语句中可以有多个窗口。

（4）查询顾客上次的购买时间

select *,lag(orderdate,1) over(distribute by name sort by orderdate)
from business;

结果：
name orderdate cost lag_window_0

jack 2017-01-01 10 NULL
jack 2017-01-05 46 2017-01-01
jack 2017-01-08 55 2017-01-05
jack 2017-02-03 23 2017-01-08
jack 2017-04-06 42 2017-02-03

tony 2017-01-02 15 NULL
tony 2017-01-04 29 2017-01-02
tony 2017-01-07 50 2017-01-04

mart 2017-04-08 62 NULL
mart 2017-04-09 68 2017-04-08
mart 2017-04-11 75 2017-04-09
mart 2017-04-13 94 2017-04-11
mart 2017-05-10 12 2017-04-13
mart 2017-06-12 80 2017-05-10

总结：这种场景常出现在电商领域，用于求用户的上次购买时间。以及从购买日志信息中查询出用户是否执行完一个完整的购买流程，用户执行完一个完整的购买流程就代表着该用户点击了某件商品之后就付了款。
比如一个电商网站的购买流程页面跳转为（登录页面->商品展示页面->商品详情页面->订单页面->支付页面）那么就要求出整个流程每一步的页面跳转率，从而分析出每一个步骤所出现的问题。

（5）查询前20%时间的订单信息
这里要使用ntile(n),将数据分为指定个数（n）的组。
这里要求前20%时间，所以要先按照时间排序，然后取出前20%的时间，所以要分成5个组，取出第一组。

1.将数据分成5个组：

select *,ntile(5) over(sort by orderdate)
from business;

结果：
name orderdate cost ntile_window_0

jack 2017-01-01 10 1
tony 2017-01-02 15 1
tony 2017-01-04 29 1
jack 2017-01-05 46 2
tony 2017-01-07 50 2
jack 2017-01-08 55 2
jack 2017-02-03 23 3
jack 2017-04-06 42 3
mart 2017-04-08 62 3
mart 2017-04-09 68 4
mart 2017-04-11 75 4
mart 2017-04-13 94 4
mart 2017-05-10 12 5
mart 2017-06-12 80 5

2.查询出前20%的订单信息：
select * from(
select name,orderdate,cost,ntile(5) over(order by orderdate) as sorted
from business
) t
where sorted = 1;

结果：
name orderdate cost sorted

jack 2017-01-01 10 1
tony 2017-01-02 15 1
tony 2017-01-04 29 1

总结：
窗口函数是针对每一行数据来说的，一行数据就对应一个窗口。举个例子来说，select *,sum(cost) from business;这一个sql语句是错误的，因为select *查询出来是多条数据，而sum(cost)查询出来是一条数据，最终结果匹配不上。
而窗口函数是针对每一行数据开了一个窗口，所以最终结果就能匹配上了。

可以把数据当成游标卡尺上的尺子，那么这个窗口函数就是游标，而分区限制了窗口的移动范围，排序限制了窗口的大小（因为排序时数据是一条一条来的）。

Rank

函数说明：
RANK() 排序相同时会重复，总数不会变，例如某班级成绩排名（有两名同学并列第一），成绩单排名为1134
DENSE_RANK() 排序相同时会重复，总数会减少(成绩单排名为1123)
ROW_NUMBER() 会根据顺序计算（成绩单排名为1234）

rank后面一定要跟着窗口函数

例如求每个学科的前三名：

select name,subject,
rank() over(partition by subject order by score desc),
DENSE_RANK() over(partition by subject order by score desc),
ROW_NUMBER() over(partition by subject order by score desc)
from score;