Hive preceding and following理解

在了解hive开窗函数前我们来看看Hive窗口函数preceding and following是怎么回事呢.
Hive窗口函数中,有一个功能是统计当前行之前或之后指定行作为一个聚合,关键字是 preceding 和 following,举例说明其使用方法.
常规的窗口函数比较简单,这里介绍一下分组的,重点是分组排序之后的rows between用法。
关键是理解rows between中关键字含义:

关键字 含义
preceding 往前
following 往后
current row 当前行
unbounded 开始行
unbounded preceding 表示从前面的起点
unbounded following 表示到后面的终点

案例

select country,time,charge,
max(charge) over (partition by country order by time) as normal,
max(charge) over (partition by country order by time rows between unbounded preceding and current row) as unb_pre_cur,
max(charge) over (partition by country order by time rows between 2 preceding and 1 following) as pre2_fol1,
max(charge) over (partition by country order by time rows between current row and unbounded following) as cur_unb_fol
from temp

注意:默认是在分组类的当前行之前的行中计算。
rows between unbounded preceding and current row和默认的一样
rows between 2 preceding and 1 following表示在当前行的前2行和后1行中计算
rows between current row and unbounded following表示在当前行和到最后行中计算
rows between对于avg、min、max、sum这几个窗口函数的含义基本是一致的,注意查看当前结果

在 hive 环境中创建临时表

create table tmp_student
(name           string,class          tinyint,cooperator_name   string,score          tinyint
)row format delimited fields terminated by '|';

加载测试数据
load data local inpath ‘text.txt’ into table tmp_student;
其中text.txt中内容为:

adf|3|测试公司1|45
xx|3|测试公司2|55
cfe|2|测试公司2|74
3dd|3|测试公司5|n
fda|1|测试公司7|80
gds|2|测试公司9|92
ffd|1|测试公司10|95
dss|1|测试公司4|95
ddd|3|测试公司3|99
gf|3|测试公司9|99

查看是否加载成功

hive> select * from tmp_student;
adf 3   测试公司1   45
xx 3    测试公司2   55
cfe 2   测试公司2   74
3dd 3   测试公司5   NULL
fda 1   测试公司7   80
gds 2   测试公司9   92
ffd 1   测试公司10  95
dss 1   测试公司4   95
ddd 3   测试公司3   99
gf  3   测试公司9   99
Time taken: 1.314 seconds, Fetched: 10 row(s)

下面来了解preceding and following函数用法,执行下面sql

selectname,score,sum(score) over(order by score range between 2 preceding and 2 following) s1, -- 当前行的score值加减2的范围内的所有行sum(score) over(order by score rows between 2 preceding and 2 following) s2, -- 当前行+前后2行,一共5行sum(score) over(order by score range between unbounded preceding and unbounded following) s3, -- 全部行,不做限制sum(score) over(order by score rows between unbounded preceding and unbounded following) s4, -- 全部行,不做限制sum(score) over(order by score) s5, -- 第一行到当前行(和当前行相同score值的所有行都会包含进去)sum(score) over(order by score rows between unbounded preceding and current row) s6, -- 第一行到当前行(和当前行相同score值的其他行不会包含进去,这是和上面的区别)sum(score) over(order by score rows between 3 preceding and current row) s7, -- 当前行+往前3行sum(score) over(order by score rows between 3 preceding and 1 following) s8, --当前行+往前3行+往后1行sum(score) over(order by score rows between current row and unbounded following) s9 --当前行+往后所有行
fromtmp.tmp_student
order by score;

注意:
当ORDER BY后面缺少窗口从句条件,窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
rows是物理窗口,是哪一行就是哪一行,与当前行的值(order by key的key的值)无关,只与排序后的行号相关,就是我们常规理解的那样。
range是逻辑窗口,与当前行的值有关(order by key的key的值),在key上操作range范围。

得到相关结果如下
通过上面的练习我们主要是对preceding and following有了一个比较全面的理解,所谓开窗函数其实就相当于flink中的滚动窗口,统计分析都是基于这个滚动窗口内完成的,所有的聚合计算统计都需要先根据range或者row确定窗口内的数据,然后就很容易得到正确的计算结果,在确定行数的过程中需要根据range和row确定是逻辑范围还是物理范围,最终都可以看作是第N行到第M行内数据的聚合统计.
Flink窗口说明

窗口函Windowing functions

  • FIRST_VALUE(col, bool DEFAULT)

    返回分组窗口内第一行col的值,DEFAULT默认为false,如果指定为true,则跳过NULL后再取值,对于FIRST_VALUE每个分组第一行数据的FIRST_VALUE(col, bool DEFAULT) 就等于col,接下来几行数据会参考第一行数据是否为NULL根据True/False进行取舍.

WITH tmp AS (SELECT 1 AS group_id, 'a' AS colUNION ALLSELECT 1 AS group_id, 'b' AS colUNION ALLSELECT 1 AS group_id, 'c' AS colUNION ALLSELECT 2 AS group_id, NULL AS colUNION ALLSELECT 2 AS group_id, 'e' AS col)
SELECT group_id, col, FIRST_VALUE(col) OVER (PARTITION BY group_id ORDER BY col) AS col_new
FROM tmp;
返回结果为:
group_id col col_new
1 a a
1 b a
1 c a
2 NULL NULL
2 e NULL
如果是True
WITH tmp AS (SELECT 1 AS group_id, NULL AS colUNION ALLSELECT 1 AS group_id, 'b' AS colUNION ALLSELECT 1 AS group_id, 'c' AS colUNION ALLSELECT 2 AS group_id, NULL AS colUNION ALLSELECT 2 AS group_id, 'e' AS col)
SELECT group_id, col, FIRST_VALUE(col, true) OVER (PARTITION BY group_id ORDER BY col) AS col_new
FROM tmp;
返回结果为:
group_id col col_new
1 NULL NULL
1 b b
1 c b
2 NULL NULL
2 e e
  • LAST_VALUE(col, bool DEFAULT)
    返回分组窗口内第后一行col的值,DEFAULT默认为false,如果指定为true,则跳过NULL后再取值.
WITH tmp AS (SELECT 1 AS group_id, 'a' AS colUNION ALLSELECT 1 AS group_id, NULL AS colUNION ALLSELECT 1 AS group_id, 'c' AS colUNION ALLSELECT 2 AS group_id, 'd' AS colUNION ALLSELECT 2 AS group_id, 'e' AS col)
SELECT group_id, col, LAST_VALUE(col) OVER (PARTITION BY group_id ORDER BY col DESC) AS col_new FROM tmp;
返回结果为:
group_id col col_new
1 c c
1 a a
1 NULL NULL
2 e e
2 d d
如果是True
WITH tmp AS (SELECT 1 AS group_id, 'a' AS colUNION ALLSELECT 1 AS group_id, NULL AS colUNION ALLSELECT 1 AS group_id, 'c' AS colUNION ALLSELECT 2 AS group_id, 'd' AS colUNION ALLSELECT 2 AS group_id, 'e' AS col)
SELECT group_id, col, LAST_VALUE(col, true) OVER (ORDER BY group_id,col DESC ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS col_new FROM tmp;
返回结果为:
group_id col col_new
1 c a
1 a a
1 NULL e
2 e d
2 d d
开窗函数不同于group by函数,开窗函数能够把所有的记录都显示出来,一般select所选择的列也都与over里面的分组和排序字段相同,这样才能比较清楚地看到当前记录在聚合函数中的区别和贡献,上面两个窗口函数我们针对最后一个案例进行说明下.
使用了开窗函数首先要确定窗口的大小,根据上面的PRECEDING和FOLLOWING讲解我们可以知道在分析时候窗口大小为[前一行,当前行,后一行],那么对于第一行1 c取出last_value就是从[空值,c,a]取出集合中最后一个就是a,同理对于第二行1 a取出last_value就是从[c,a,Null]中取出最后一个Null跳过再取得到a,对于2 e从集合[Null,e,d]last_value=d*
  • LEAD(col, n, DEFAULT)

返回分组窗口内往下第n行col的值,n默认为1,往下第n没有时返回DEFAULT(DEFAULT默认为NULL)使用分组后那么分组之间就不交叉计算.

WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'd' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,LEAD(col) over(partition by group_id order by col) as col_new
FROM tmp;

返回结果

group_id col col_new
1 a b
1 b c
1 c NULL
2 d e
2 e NULL

等同于

WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'd' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,LAST_VALUE(col) over(partition by group_id order by col rows between 1 FOLLOWING and 1 FOLLOWING) as col_new
FROM tmp;

其中rows between 1 FOLLOWING and 1 FOLLOWING为从往后一行开始到往后一行结束=往后一行
返回结果

group_id col col_new
1 a b
1 b c
1 c NULL
2 d e
2 e NULL

使用LEAD默认值

WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'd' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,LEAD(col, 2, 'z') over(partition by group_id order by col) as col_new
FROM tmp;

返回结果

group_id col col_new
1 a c
1 b z
1 c z
2 d z
2 e z
  • LAG(col, n, DEFAULT)
    返回分组窗口内往上第n行col的值,n默认为1,往上第n没有时返回DEFAULT(DEFAULT默认为NULL)
WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'd' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,LAG(col) over(partition by group_id order by col) as col_new
FROM tmp;

等同于

WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'd' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,FIRST_VALUE(col) over(partition by group_id order by col rows BETWEEN 1 PRECEDING and 1 PRECEDING) as col_new
FROM tmp;

返回结果都是

group_id col col_new
1 a NULL
1 b a
1 c b
2 d NULL
2 e d

使用默认值

WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'd' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,LAG(col, 2, 'zz') over(partition by group_id order by col) as col_new
FROM tmp;

返回结果

group_id col col_new
1 a zz
1 b zz
1 c a
2 d zz
2 e zz

OVER详解 The OVER clause

** FUNCTION(expr) OVER([PARTITION BY statement] [ORDER BY statement] [window clause]) **
中括号为可选参数
FUNCTION:包括标准聚合函数(COUNT/SUM/MIN/MAX/AVG)和一些分析函数(RANK/ROW_NUMBER/DENSE_RANK等)
PARTITION BY:可以由一个或者多个列组成
ORDER BY:可以由一个或者多个列组成
window clause:(ROWS | RANGE) BETWEEN (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW) AND (UNBOUNDED PRECEDING | num PRECEDING | CURRENT ROW)
当window clause 未指定时默认为RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,即分组内第一行至当前行作为窗口
当 window clause和ORDER BY都未指定时,默认为ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
**即分组内第一行至最后一行作为窗口.**

标准聚合函数

COUNT(expr) OVER()
返回窗口内行数
WITH tmp AS
(SELECT 1 AS group_id, 'a' AS col UNION ALL SELECT 1 AS group_id,  'b' AS col UNION ALL SELECT 1 AS group_id,  'c' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col UNION ALL SELECT 2 AS group_id,  'e' AS col
)
SELECT group_id,col,count(col) over(partition by group_id) as cnt1,count(col) over(partition by group_id order by col) as cnt2,count(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt3,count(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as cnt4
FROM tmp;
返回结果为
group_id col cnt1 cnt2 cnt3 cnt4
1 a 3 1 3 3
1 b 3 2 2 2
1 c 3 3 1 1
2 e 2 2 2 1
2 e 2 2 1 1SUM(expr) OVER()
返回窗口内求和值
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  2 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,col,SUM(col) over(partition by group_id) as sum1,SUM(col) over(partition by group_id order by col) as sum2,SUM(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum3,SUM(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as sum4
FROM tmp;
返回结果为
group_id col sum1 sum2 sum3 sum4
1 1 6 1 6 6
1 2 6 3 5 5
1 3 6 6 3 3
2 4 8 8 8 4
2 4 8 8 4 4MIN(expr) OVER()
返回窗口内最小值
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  2 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,MIN(col) over(partition by group_id) as min1,MIN(col) over(partition by group_id order by col) as min2,MIN(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as min3
FROM tmp;
group_id col min1 min2 min3
1 1 1 1 1
1 2 1 1 2
1 3 1 1 3
2 4 4 4 4
2 5 4 4 5MAX(expr) OVER()
返回窗口内最大值
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  2 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,MAX(col) over(partition by group_id) as max1,MAX(col) over(partition by group_id order by col) as max2,MAX(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as max3
FROM tmp;
返回结果为
group_id col max1 max2 max3
1 1 3 1 3
1 2 3 2 3
1 3 3 3 3
2 4 5 4 5
2 5 5 5 5AVG(expr) OVER()
返回窗口内平均值
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  2 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,col,AVG(col) over(partition by group_id) as avg1,AVG(col) over(partition by group_id order by col) as avg2,AVG(col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg3,AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg4
FROM tmp;
返回结果为
|group_id|col|avg1|avg2|avg3|avg4|
|1|1|2.0|1.0|2.0|2.0|
|1|2|2.0|1.5|2.5|2.5|
|1|3|2.0|2.0|3.0|3.0|
|2|4|4.0|4.0|4.0|4.0|
|2|4|4.0|4.0|4.0|4.0|分析函数 Analytics functions
RANK() OVER()
返回分组内排名(不支持自定义窗口)
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,RANK() over(partition by group_id order by col desc) as r
FROM tmp;
返回结果为
|group_id|col|r|
|1|3|1|
|1|3|1|
|1|1|3|
|2|5|1|
|2|4|2|ROW_NUMBER() OVER()
返回分组内行号(不支持自定义窗口)
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,ROW_NUMBER() over(partition by group_id order by col desc) as r
FROM tmp;
返回结果为
|group_id|col|r|
|1|3|1|
|1|3|2|
|1|1|3|
|2|5|1|
|2|4|2|DENSE_RANK() OVER()
返回分组内排名(排名相等不会留下空位,不支持自定义窗口)
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,DENSE_RANK() over(partition by group_id order by col desc) as r
FROM tmp;
返回结果为
|group_id|col|r|
|1|3|1|
|1|3|1|
|1|1|2|
|2|5|1|
|2|4|2|CUME_DIST() OVER()
返回分组内累计分布值,即分组内小于(或者大于)等于当前值行数/分组内总行数
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,CUME_DIST() over(partition by group_id order by col asc) as d1,CUME_DIST() over(partition by group_id order by col desc) as d2
FROM tmp;返回结果为
|group_id|col|d1|d2|
|1|3|1.0|0.6666666666666666|
|1|3|1.0|0.6666666666666666|
|1|1|0.3333333333333333|1.0|
|2|5|1.0|0.5|
|2|4|0.5|1.0|PERCENT_RANK() OVER()
返回百分比排序值,即分组内当前行的RANK值-1/分组内总行数-1
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,RANK() over(partition by group_id order by col asc) as r1,PERCENT_RANK() over(partition by group_id order by col asc) as p1,RANK() over(partition by group_id order by col desc) as r2,PERCENT_RANK() over(partition by group_id order by col desc) as p2
FROM tmp;返回结果为
|group_id|col|r1|p1|r2|p2|
|1|3|2|0.5|1|0.0|
|1|3|2|0.5|1|0.0|
|1|1|1|0.0|3|1.0|
|2|5|2|1.0|1|0.0|
|2|4|1|0.0|2|1.0|NTILE(INTEGER x) OVER()
返回分区编号(将有序分区划分为x个组,称为bucket,并为分区中的每一行分配一个bucket编号)
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,col,NTILE(2) over(partition by group_id order by col asc) as bucket_id
FROM tmp;返回结果为
|group_id|col|bucket_id|
|1|1|1|
|1|3|1|
|1|3|2|
|1|3|2|
|2|4|1|
|2|5|2|
OVER子句也支持聚合函数
Hive 2.1.0及之后版本,OVER子句也支持聚合函数,如:
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  5 AS col
)
SELECT group_id,RANK() over(order by sum(col) desc) as r
FROM tmp
group by group_id;
结果为
|group_id|r|
|2|1|
|1|2|window clause 的另一种写法
将window子句写在from后面,在over后使用别名进行引用,如下:
WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  2 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,col,AVG(col) over w1 as avg1,AVG(distinct col) over(partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following) as avg2
FROM tmp
WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);结果为
|group_id|col|avg1|avg2|
|1|1|2.0|2.0|
|1|2|2.5|2.5|
|1|3|3.0|3.0|
|2|4|4.0|4.0|
|2|4|4.0|4.0|WITH tmp AS
(SELECT 1 AS group_id, 1 AS col UNION ALL SELECT 1 AS group_id,  2 AS col UNION ALL SELECT 1 AS group_id,  3 AS col UNION ALL SELECT 2 AS group_id,  4 AS col UNION ALL SELECT 2 AS group_id,  4 AS col
)
SELECT group_id,col,AVG(col) over w1 as avg1,AVG(distinct col) over w2 as avg2
FROM tmp
WINDOW w1 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following),
w2 AS (partition by group_id order by col rows between CURRENT ROW and UNBOUNDED following);结果为
|group_id|col|avg1|avg2|
|1|1|2.0|2.0|
|1|2|2.5|2.5|
|1|3|3.0|3.0|
|2|4|4.0|4.0|
|2|4|4.0|4.0|

本文完.
Any suggestions and criticisms will be sincerely welcomed.
资料

https://blog.csdn.net/happyrocking/article/details/105369558
https://docs.aws.amazon.com/redshift/latest/dg/redshift
https://www.jianshu.com/p/3f3cf58472ca
https://www.cnblogs.com/hyunbar/p/13524855.html
https://blog.csdn.net/weixin_42307036/article/details/112381387

hive窗口函数总结相关推荐

  1. 大数据技术-hive窗口函数详解

    有不少同学一听这个标题,hive窗口函数是什么鬼?没听说过还有窗口函数这个东西啊,其实它的用处可大了,下面听小千慢慢道来. hive窗口函数 窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行) ...

  2. HiveQL学习笔记(四):Hive窗口函数

    本系列是本人对Hive的学习进行一个整理,主要包括以下内容: 1.HiveQL学习笔记(一):Hive安装及Hadoop,Hive原理简介 2.HiveQL学习笔记(二):Hive基础语法与常用函数 ...

  3. hive窗口函数使用

    hive窗口函数的使用 前言 一.hive窗口函数语法 1.over()窗口函数的语法结构 1.1.over()函数中的三个函数讲解 2.常与over()一起使用的分析函数 2.1.聚合类 2.2.排 ...

  4. Hive 窗口函数 实现原理

    Hive 窗口函数 实现原理 hive中窗口函数的实现,主要是借助于一个叫做 Windowing Table Function 的Partitioned Table Function Partitio ...

  5. hive 窗口函数(持续更新)

    hive窗口函数语法 avg().sum().max().min()等是分析函数,而over()才是窗口函数,下面我们来看看over()窗口函数的语法结构.及常与over()一起使用的分析函数: 1. ...

  6. HIVE 窗口函数和分析函数

    **HIVE 窗口函数和分析函数** 第一篇,试试水: 一.介绍 分析函数用于计算基于组的某种聚合值,它和聚合函数的不同之处是:对于每个组返回多行,而聚合函数对于每个组只返回一行. 开窗函数指定了分析 ...

  7. Hive窗口函数之累积值、平均值、首尾值的计算学习

    Hive窗口函数可以计算一定范围内.一定值域内.或者一段时间内的累积和以及移动平均值等:可以结合聚集函数SUM() .AVG()等使用:可以结合FIRST_VALUE() 和LAST_VALUE(), ...

  8. hive 窗口函数_Datatist科技专栏 | Hive排序窗口函数速学教程!

    作者:原上野 设计:Cindy 编辑:AI君 在开发过程中经常会遇见排序的场景,比如取top N的问题,这时候row_number(),rank,dense_ran()这三个函数就派上用场了,其中,r ...

  9. hive窗口函数_Hive sql窗口函数源码分析

    在了解了窗口函数实现原理 spark.hive中窗口函数实现原理复盘 和 sparksql比hivesql优化的点(窗口函数)之后,今天又撸了一遍hive sql 中窗口函数的源码实现,写个笔记记录一 ...

  10. Hive窗口函数(over/partition by/order by/window/序列函数)总结与实践

    一.简介 本文主要介绍Hive中的窗口函数,Hive中的窗口函数和SQL中的窗口函数相类似,都是用来做一些数据分析类的工作,一般用于olap分析(在线分析处理). 二.概念 我们都知道在sql中有一类 ...

最新文章

  1. C语言选择排序(解析)
  2. 洛谷 - P2761 软件补丁问题(spfa+状压)
  3. (论文阅读笔记1)Collaborative Metric Learning(一)(WWW2017)
  4. Image Processing --- Gaussian Pyramid Laplacian Pyramid
  5. Linux下查看CPU信息、机器型号等硬件信息命令
  6. Jersey客户端API调用REST风格的Web服务
  7. 算法:Invert Binary Tree(翻转二叉树)
  8. 网管工具mrtg,cacti,rrdtool,nagios,zabbix比较和安装,+zabbix
  9. 华为交换机路由器最新默认密码大全
  10. PXE配置-Tftpd64
  11. SSH和SSM有什么区别?
  12. 华为计算机怎么算根号,华为手机计算器的根号是什?
  13. 电商创业的五大新模式
  14. Java线程中yield与join方法的区别
  15. 关于jetson 的一堆奇怪bug的修复之路(同时解决sdkmanger升级/安装tensorrt提示密码错误,无法打开terminal)
  16. linux上怎么实现ssh免密登录
  17. String的底层分析 (学习笔记)
  18. 使用JBE(Java Bytecode Editor)修改Java字节码
  19. CAD梦想画图中“绘图工具——正多边形”
  20. Python编曲实践(八):我,乔鲁诺·乔巴那,能用两百行代码写出JOJO黄金之风里我自己的出场曲!

热门文章

  1. 录音转文字python实现
  2. Vue H5 项目模板
  3. ISO9000 质量管理和质量保证系列国际标准
  4. Java常用设计模式(面试常考)
  5. 计算机软考高级论文怎么写,【干货】软考高级论文怎么写易得高分?
  6. PCB封装库创建流程及注意事项
  7. Vue——B站黑马程序员教程
  8. 学习UVC协议需要的工具
  9. spring源码下载安装,导入idea以及编译报错问题详细解决过程
  10. java安装包_java JDK安装包的获取与安装