Hive之窗口函数(partition) / order by / row_number / date_sub 等函数联合使用案例(9)
案例1:求出连续销售3天的店铺
一 数据源 ,将数据源(结构化数据)导入到本地一个新建的文件中
数据源 :
name,ctime, cost
a,2020-02-10,600
a,2020-03-01,200
a,2020-03-02,300
a,2020-03-03,200
a,2020-03-04,400
a,2020-03-05,600
a,2020-02-05,200
a,2020-02-06,300
a,2020-02-07,200
a,2020-02-08,400
b,2020-02-05,200
b,2020-02-06,300
b,2020-02-08,200
b,2020-02-09,400
b,2020-02-10,600
c,2020-01-31,200
c,2020-02-01,300
c,2020-02-02,200
c,2020-02-03,400
c,2020-02-10,600将数据添加到linux本地,生成一个静态文件
vi /root/hive/business/sell.log
二 建表 ,将数据文件加载到表里面,查询数据加载情况
删除表
drop table tb_sell;建表
create table tb_sell(
name string,
ctime string,
cost double
)
row format delimited fields terminated by ",";加载数据
load data local inpath "/root/hive/business/sell.log" into table tb_sell;查询数据加载情况
select * from tb_sell;
+---------------+----------------+---------------+
| tb_sell.name | tb_sell.ctime | tb_sell.cost |
+---------------+----------------+---------------+
| a | 2020-02-10 | 600.0 |
| a | 2020-03-01 | 200.0 |
| a | 2020-03-02 | 300.0 |
| a | 2020-03-03 | 200.0 |
| a | 2020-03-04 | 400.0 |
| a | 2020-03-05 | 600.0 |
| a | 2020-02-05 | 200.0 |
| a | 2020-02-06 | 300.0 |
| a | 2020-02-07 | 200.0 |
| a | 2020-02-08 | 400.0 |
| b | 2020-02-05 | 200.0 |
| b | 2020-02-06 | 300.0 |
| b | 2020-02-08 | 200.0 |
| b | 2020-02-09 | 400.0 |
| b | 2020-02-10 | 600.0 |
| c | 2020-01-31 | 200.0 |
| c | 2020-02-01 | 300.0 |
| c | 2020-02-02 | 200.0 |
| c | 2020-02-03 | 400.0 |
| c | 2020-02-10 | 600.0 |
+---------------+----------------+---------------+
三 需求实现思路/步骤
1 将相同字段分为一个窗口并且为同一字段每一行进行编号 ,相同店铺分为一个窗口
select
*,
row_number() over(partition by name) 按照名字分区并且相同区的进行编号
from
tb_sell;
+---------------+----------------+---------------+----------------------+
| tb_sell.name | tb_sell.ctime | tb_sell.cost | row_number_window_0 |
+---------------+----------------+---------------+----------------------+
| a | 2020-02-10 | 600.0 | 1 |
| a | 2020-02-08 | 400.0 | 2 |
| a | 2020-02-07 | 200.0 | 3 |
| a | 2020-02-06 | 300.0 | 4 |
| a | 2020-02-05 | 200.0 | 5 |
| a | 2020-03-05 | 600.0 | 6 |
| a | 2020-03-04 | 400.0 | 7 |
| a | 2020-03-03 | 200.0 | 8 |
| a | 2020-03-02 | 300.0 | 9 |
| a | 2020-03-01 | 200.0 | 10 |
| b | 2020-02-10 | 600.0 | 1 |
| b | 2020-02-09 | 400.0 | 2 |
| b | 2020-02-08 | 200.0 | 3 |
| b | 2020-02-06 | 300.0 | 4 |
| b | 2020-02-05 | 200.0 | 5 |
| c | 2020-02-10 | 600.0 | 1 |
| c | 2020-02-03 | 400.0 | 2 |
| c | 2020-02-02 | 200.0 | 3 |
| c | 2020-02-01 | 300.0 | 4 |
| c | 2020-01-31 | 200.0 | 5 |
+---------------+----------------+---------------+----------------------+
2 因为日期是混乱的 ,需要将日期进行排序 ,方便后续函数运算 ,在划分的区域里面安装时间字段进行排序 ,并给编号字段别名
select
*,row_number() over(partition by name order by ctime) rn
from
tb_sell; 原row_number_window ,现rn
+---------------+----------------+---------------+-----+
| tb_sell.name | tb_sell.ctime | tb_sell.cost | rn |按name进行分区,然后ctime进行排序,并且进行编号
+---------------+----------------+---------------+-----+
| a | 2020-02-05 | 200.0 | 1 |
| a | 2020-02-06 | 300.0 | 2 |
| a | 2020-02-07 | 200.0 | 3 |
| a | 2020-02-08 | 400.0 | 4 |
| a | 2020-02-10 | 600.0 | 5 |
| a | 2020-03-01 | 200.0 | 6 |
| a | 2020-03-02 | 300.0 | 7 |
| a | 2020-03-03 | 200.0 | 8 |
| a | 2020-03-04 | 400.0 | 9 |
| a | 2020-03-05 | 600.0 | 10 |
| b | 2020-02-05 | 200.0 | 1 |
| b | 2020-02-06 | 300.0 | 2 |
| b | 2020-02-08 | 200.0 | 3 |
| b | 2020-02-09 | 400.0 | 4 |
| b | 2020-02-10 | 600.0 | 5 |
| c | 2020-01-31 | 200.0 | 1 |
| c | 2020-02-01 | 300.0 | 2 |
| c | 2020-02-02 | 200.0 | 3 |
| c | 2020-02-03 | 400.0 | 4 |
| c | 2020-02-10 | 600.0 | 5 |
+---------------+----------------+---------------+-----+
3 用 ctime 字段减去 rn 字段 ,差值相同说明销售时间是连续的
select
*,
date_sub(ctime,rn)as diff
from
(
select
*,row_number() over(partition by name order by ctime)rn
from
tb_sell)t1;
+----------+-------------+----------+--------+-------------+
| t1.name | t1.ctime | t1.cost | t1.rn | diff | diff=ctime-rn
+----------+-------------+----------+--------+-------------+
| a | 2020-02-05 | 200.0 | 1 | 2020-02-04 |
| a | 2020-02-06 | 300.0 | 2 | 2020-02-04 |
| a | 2020-02-07 | 200.0 | 3 | 2020-02-04 |
| a | 2020-02-08 | 400.0 | 4 | 2020-02-04 |
| a | 2020-02-10 | 600.0 | 5 | 2020-02-05 |
| a | 2020-03-01 | 200.0 | 6 | 2020-02-24 |
| a | 2020-03-02 | 300.0 | 7 | 2020-02-24 |
| a | 2020-03-03 | 200.0 | 8 | 2020-02-24 |
| a | 2020-03-04 | 400.0 | 9 | 2020-02-24 |
| a | 2020-03-05 | 600.0 | 10 | 2020-02-24 |
| b | 2020-02-05 | 200.0 | 1 | 2020-02-04 |
| b | 2020-02-06 | 300.0 | 2 | 2020-02-04 |
| b | 2020-02-08 | 200.0 | 3 | 2020-02-05 |
| b | 2020-02-09 | 400.0 | 4 | 2020-02-05 |
| b | 2020-02-10 | 600.0 | 5 | 2020-02-05 |
| c | 2020-01-31 | 200.0 | 1 | 2020-01-30 |
| c | 2020-02-01 | 300.0 | 2 | 2020-01-30 |
| c | 2020-02-02 | 200.0 | 3 | 2020-01-30 |
| c | 2020-02-03 | 400.0 | 4 | 2020-01-30 |
| c | 2020-02-10 | 600.0 | 5 | 2020-02-05 |
+----------+-------------+----------+--------+-------------+
4 将相同 diff 的进行count计数累加 : 按 name ,diff 分组 ,得出相同的name 相同diff 出现的次数然后累计相加 count,就可以得到连续出现的次数
select
name,diff,
count(1) days 然后相同diff进行相加
from
(
select
*,
date_sub(ctime,rn)as diff
from
(
select
*,
row_number() over(partition by name order by ctime)rn
from
tb_sell)t1)t2
group by name,diff 按 name和diff 分组
having days>3; 再然后将相加后大于3的diff过滤掉+-------+-------------+-------+
| name | diff | days |
+-------+-------------+-------+
| a | 2020-02-04 | 4 |
| a | 2020-02-24 | 5 |
| c | 2020-01-30 | 4 |
+-------+-------------+-------+
5 因为有重复数据,需要对重复数据进行去重
select
distinct name 对名字相同的进行去重
from
( select
name,diff,
count(1) days
from
(
select
*,
date_sub(ctime,rn)as diff
from
(
select
*,
row_number() over(partition by name order by ctime)rn
from
tb_sell)t1
)t2
group by name,diff
having days>3) t3;
+-------+
| name | 需求是 : 求出连续销售3天的店铺
+-------+
| a |
| c |
+-------+
案例2 : 找出连续击中三次土拔鼠的用户
一 数据源 ,将数据加入到 linux 本地,生成静态文件
数据源
uid,fight , hit(打中) 1为打中,0为没打中
u01,1,1
u01,2,0
u01,3,1
u01,4,1
u01,5,0
u01,6,1
u02,1,1
u02,2,1
u02,3,0
u02,4,1
u02,5,1
u02,6,0
u02,7,0
u02,8,1
u02,9,1
u03,1,1
u03,2,1
u03,3,1
u03,4,1
u03,5,1
u03,6,0将数据插入到linux本地的,生成静态文件
vi /root/hive/hitmouse.txt
二 建表,将数据加载到表里面,然后查询数据加载情况
删除表
drop table tb_hitmouse;创建一个普通的内部表 (内部表 : 删除表时,表目录下的对应的文件会被删除掉)
create table tb_hitmouse(
uid string,
seq int,
hit int
)
row format delimited fields terminated by ",";加载数据
load data local inpath "/root/hive/hitmouse.txt" into table tb_hitmouse;查看数据加载情况
select * from tb_hitmouse;
+------------------+--------------------+------------------+
| tb_hitmouse.uid | tb_hitmouse.seq | tb_hitmouse.hit |
+------------------+--------------------+------------------+
| u01 | 1 | 1 |
| u01 | 2 | 0 |
| u01 | 3 | 1 |
| u01 | 4 | 1 |
| u01 | 5 | 0 |
| u01 | 6 | 1 |
| u02 | 1 | 1 |
| u02 | 2 | 1 |
| u02 | 3 | 0 |
| u02 | 4 | 1 |
| u02 | 5 | 1 |
| u02 | 6 | 0 |
| u02 | 7 | 0 |
| u02 | 8 | 1 |
| u02 | 9 | 1 |
| u03 | 1 | 1 |
| u03 | 2 | 1 |
| u03 | 3 | 1 |
| u03 | 4 | 1 |
| u03 | 5 | 1 |
| u03 | 6 | 0 |
+------------------+--------------------+------------------+
三 需求实现思路和步骤
1 将用户没有打中的数据去除掉/过滤掉,只留下用户打中的数据(where hit=1),然后按照用户进行分区,给相同区域内的每一行编号 row_number()
select
* ,
row_number() over(partition by uid) rownumb
from
tb_hitmouse
where
hit=1;
+------------------+--------------------+------------------+----------------------+
| tb_hitmouse.uid | tb_hitmouse.fight | tb_hitmouse.hit | row_number_window_0 |
+------------------+--------------------+------------------+----------------------+
| u01 | 1 | 1 | 1 |
| u01 | 3 | 1 | 2 |
| u01 | 4 | 1 | 3 |
| u01 | 6 | 1 | 4 |
| u02 | 5 | 1 | 1 |
| u02 | 9 | 1 | 2 |
| u02 | 8 | 1 | 3 |
| u02 | 4 | 1 | 4 |
| u02 | 2 | 1 | 5 |
| u02 | 1 | 1 | 6 |
| u03 | 5 | 1 | 1 |
| u03 | 2 | 1 | 2 |
| u03 | 1 | 1 | 3 |
| u03 | 3 | 1 | 4 |
| u03 | 4 | 1 | 5 |
+------------------+--------------------+------------------+----------------------+
2 在以上基础上 ,按照 seq 字段进行排序
select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1;
+------------------+------------------+------------------+----------+
| tb_hitmouse.uid | tb_hitmouse.seq | tb_hitmouse.hit | rownumb |
+------------------+------------------+------------------+----------+
| u01 | 1 | 1 | 1 |
| u01 | 3 | 1 | 2 |
| u01 | 4 | 1 | 3 |
| u01 | 6 | 1 | 4 |
| u02 | 1 | 1 | 1 |
| u02 | 2 | 1 | 2 |
| u02 | 4 | 1 | 3 |
| u02 | 5 | 1 | 4 |
| u02 | 8 | 1 | 5 |
| u02 | 9 | 1 | 6 |
| u03 | 1 | 1 | 1 |
| u03 | 2 | 1 | 2 |
| u03 | 3 | 1 | 3 |
| u03 | 4 | 1 | 4 |
| u03 | 5 | 1 | 5 |
+------------------+------------------+------------------+----------+
3 在以上基础上 ,因为 seq 和 rownumb 字段属性相同,可以直接进行相减 ,相减后得到的字段diff 相同,说明是连续击中的
select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t;
+--------+--------+--------+------------+-------+
| t.uid | t.seq | t.hit | t.rownumb | diff |
+--------+--------+--------+------------+-------+
| u01 | 1 | 1 | 1 | 0 |
| u01 | 3 | 1 | 2 | 1 |
| u01 | 4 | 1 | 3 | 1 |
| u01 | 6 | 1 | 4 | 2 |
| u02 | 1 | 1 | 1 | 0 |
| u02 | 2 | 1 | 2 | 0 |
| u02 | 4 | 1 | 3 | 1 |
| u02 | 5 | 1 | 4 | 1 |
| u02 | 8 | 1 | 5 | 3 |
| u02 | 9 | 1 | 6 | 3 |
| u03 | 1 | 1 | 1 | 0 |
| u03 | 2 | 1 | 2 | 0 |
| u03 | 3 | 1 | 3 | 0 |
| u03 | 4 | 1 | 4 | 0 |
| u03 | 5 | 1 | 5 | 0 |
+--------+--------+--------+------------+-------+
4 在以上基础上 ,按照用户和diff 字段进行分组 ,然后将连续击中的进行相加/累加 ,得到 cc 字段
select
uid,
diff,
count(1) cc
from
(select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t)t1
group by uid,diff
;
+------+-------+-----+
| uid | diff | cc |
+------+-------+-----+
| u01 | 0 | 1 |
| u01 | 1 | 2 |
| u01 | 2 | 1 |
| u02 | 0 | 2 |
| u02 | 1 | 2 |
| u02 | 3 | 2 |
| u03 | 0 | 5 |
+------+-------+-----+
5 在以上基础上 ,对 cc 字段进行过滤 ,留下 cc > 3 的用户
select
uid,
diff,
count(1) cc
from
(select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t)t1
group by uid,diff
having cc>3;
+------+-------+-----+
| uid | diff | cc |
+------+-------+-----+
| u03 | 0 | 5 |
+------+-------+-----+
6 对用户进行去重
select
distinct uid
from
(select
uid,
diff,
count(1) cc
from
(select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t)t1
group by uid,diff
having cc>3)t2;
+------+
| uid | 需求是 : 找出连续击中三次土拔鼠的用户
+------+
| u03 |
+------+
Hive之窗口函数(partition) / order by / row_number / date_sub 等函数联合使用案例(9)相关推荐
- HIVE:窗口函数,用sql语句查询MySQL安装路径和版本
数据大师: Jmx's Blog | Keep it Simple and Stupid! 猴子 - 知乎公众号(猴子数据分析)著有畅销书<数据分析思维>科普中国专家 回答数 647,获得 ...
- Hive分析窗口函数系列文章
分析窗口函数应用场景: (1)用于分区排序 (2)动态Group By (3)Top N (4)累计计算 (5)层次查询 Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive中提供了越 ...
- Hive常用窗口函数实战
点击蓝字关注! 本篇文章大概3799字,阅读时间大约10分钟 本文介绍了Hive常见的序列函数,排名函数和窗口函数.结合业务场景展示了Hive分析函数的使用 Hive作为一个分析型的数据仓库组件提供了 ...
- <Zhuuu_ZZ>HIVE(九)窗口函数
Hive窗口函数 总览 一 数据准备 二 over开窗 二 partition by子句 三 order by 子句 四 window 子句 五 序列函数 ntile 六 lag和lead函数 七 f ...
- hive的窗口函数详解
1.1 hive窗口函数 窗口函数是什么鬼? 窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化.窗口函数和聚合函数区别? 窗口函数对于每个组返回多行 ...
- Hive分析窗口函数
分析窗口函数应用场景: (1)用于分区排序 (2)动态Group By (3)Top N (4)累计计算 (5)层次查询 Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive中提供了越 ...
- 使用Hive的窗口函数进行数据分析——以股票市场分析为例
声明:本文主要是实现利用Hive常用的窗口函数和一些数据分析思维分析数据,只是套用在股票数据的例子上,因此并不适用于提高投资技巧! 我们先看一下常用Hive中常用的窗口: PRECEDING:往前 F ...
- Hive分析窗口函数(一) SUM,AVG,MIN,MAX
Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive中提供了越来越多的分析函数,用于完成负责的统计分析.抽时间将所有的分析窗 ...
- Hive 的窗口函数
本文首发于:微信公众号[大数据每日哔哔,文章:Hive SQL 窗口函数 在 SQL 中有一类函数叫做聚合函数,例如 sum().avg().max().min() 等等,这类函数可以将多行数据按照规 ...
最新文章
- 埃森哲:技术改变看病的五大趋势!每个人都将受益【附下载】| 智东西内参...
- tableau可视化数据分析60讲(四)-tableau数据源操作数据提取
- SAP Engagement Center的ShellCarousel控件control
- html 点击按钮js自增,JS实现点击按钮自动增加一个单元格的方法
- 中文场景文字识别技术创新大赛,总奖池5.4万!
- linux脚本生成数字写入文本,4.2 编写Shell脚本(P80-85)——《Linux就该这么学》学习笔记16...
- FFmpeg的H.264解码器源代码简单分析:宏块解码(Decode)部分-帧间宏块(Inter)
- End Game----OO最后一次博客作业
- python:urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed>
- qgridlayout 滚动时固定第一行_【德国进口轴承】滚动轴承组合设计应考虑的问题...
- 方波峰峰值和有效值_峰峰值,峰值,平均值,有效值的关系
- php blowfish 解密,php blowfish加密解密算法
- 查询京东快递物流状态,快速筛选出代收的单号
- tom猫变声原理解析
- 程序员忽悠女朋友玩gal
- java获取sqlserver连接并插入数据
- AS3加载外部swf资源库中的元件(MovieClip)
- 最严“22条措施”打击市场乱象 云南旅游“浴火重生”
- 南大科院大数据Hadoop工程实训
- HTC ONE M7 ROOT后恢复 原始状态(保修)