案例1:求出连续销售3天的店铺

一   数据源 ,将数据源(结构化数据)导入到本地一个新建的文件中

数据源 :
name,ctime, cost
a,2020-02-10,600
a,2020-03-01,200
a,2020-03-02,300
a,2020-03-03,200
a,2020-03-04,400
a,2020-03-05,600
a,2020-02-05,200
a,2020-02-06,300
a,2020-02-07,200
a,2020-02-08,400
b,2020-02-05,200
b,2020-02-06,300
b,2020-02-08,200
b,2020-02-09,400
b,2020-02-10,600
c,2020-01-31,200
c,2020-02-01,300
c,2020-02-02,200
c,2020-02-03,400
c,2020-02-10,600将数据添加到linux本地,生成一个静态文件
vi /root/hive/business/sell.log

二   建表 ,将数据文件加载到表里面,查询数据加载情况

删除表
drop table tb_sell;建表
create table tb_sell(
name string,
ctime string,
cost double
)
row format delimited fields terminated by ",";加载数据
load data local inpath "/root/hive/business/sell.log" into table tb_sell;查询数据加载情况
select * from tb_sell;
+---------------+----------------+---------------+
| tb_sell.name  | tb_sell.ctime  | tb_sell.cost  |
+---------------+----------------+---------------+
| a             | 2020-02-10     | 600.0         |
| a             | 2020-03-01     | 200.0         |
| a             | 2020-03-02     | 300.0         |
| a             | 2020-03-03     | 200.0         |
| a             | 2020-03-04     | 400.0         |
| a             | 2020-03-05     | 600.0         |
| a             | 2020-02-05     | 200.0         |
| a             | 2020-02-06     | 300.0         |
| a             | 2020-02-07     | 200.0         |
| a             | 2020-02-08     | 400.0         |
| b             | 2020-02-05     | 200.0         |
| b             | 2020-02-06     | 300.0         |
| b             | 2020-02-08     | 200.0         |
| b             | 2020-02-09     | 400.0         |
| b             | 2020-02-10     | 600.0         |
| c             | 2020-01-31     | 200.0         |
| c             | 2020-02-01     | 300.0         |
| c             | 2020-02-02     | 200.0         |
| c             | 2020-02-03     | 400.0         |
| c             | 2020-02-10     | 600.0         |
+---------------+----------------+---------------+

三  需求实现思路/步骤

1   将相同字段分为一个窗口并且为同一字段每一行进行编号 ,相同店铺分为一个窗口

select
*,
row_number() over(partition by name)     按照名字分区并且相同区的进行编号
from
tb_sell;
+---------------+----------------+---------------+----------------------+
| tb_sell.name  | tb_sell.ctime  | tb_sell.cost  | row_number_window_0  |
+---------------+----------------+---------------+----------------------+
| a             | 2020-02-10     | 600.0         | 1                    |
| a             | 2020-02-08     | 400.0         | 2                    |
| a             | 2020-02-07     | 200.0         | 3                    |
| a             | 2020-02-06     | 300.0         | 4                    |
| a             | 2020-02-05     | 200.0         | 5                    |
| a             | 2020-03-05     | 600.0         | 6                    |
| a             | 2020-03-04     | 400.0         | 7                    |
| a             | 2020-03-03     | 200.0         | 8                    |
| a             | 2020-03-02     | 300.0         | 9                    |
| a             | 2020-03-01     | 200.0         | 10                   |
| b             | 2020-02-10     | 600.0         | 1                    |
| b             | 2020-02-09     | 400.0         | 2                    |
| b             | 2020-02-08     | 200.0         | 3                    |
| b             | 2020-02-06     | 300.0         | 4                    |
| b             | 2020-02-05     | 200.0         | 5                    |
| c             | 2020-02-10     | 600.0         | 1                    |
| c             | 2020-02-03     | 400.0         | 2                    |
| c             | 2020-02-02     | 200.0         | 3                    |
| c             | 2020-02-01     | 300.0         | 4                    |
| c             | 2020-01-31     | 200.0         | 5                    |
+---------------+----------------+---------------+----------------------+

2   因为日期是混乱的 ,需要将日期进行排序 ,方便后续函数运算 ,在划分的区域里面安装时间字段进行排序 ,并给编号字段别名

select
*,row_number() over(partition by name order by ctime) rn
from
tb_sell;                                         原row_number_window ,现rn
+---------------+----------------+---------------+-----+
| tb_sell.name  | tb_sell.ctime  | tb_sell.cost  | rn  |按name进行分区,然后ctime进行排序,并且进行编号
+---------------+----------------+---------------+-----+
| a             | 2020-02-05     | 200.0         | 1   |
| a             | 2020-02-06     | 300.0         | 2   |
| a             | 2020-02-07     | 200.0         | 3   |
| a             | 2020-02-08     | 400.0         | 4   |
| a             | 2020-02-10     | 600.0         | 5   |
| a             | 2020-03-01     | 200.0         | 6   |
| a             | 2020-03-02     | 300.0         | 7   |
| a             | 2020-03-03     | 200.0         | 8   |
| a             | 2020-03-04     | 400.0         | 9   |
| a             | 2020-03-05     | 600.0         | 10  |
| b             | 2020-02-05     | 200.0         | 1   |
| b             | 2020-02-06     | 300.0         | 2   |
| b             | 2020-02-08     | 200.0         | 3   |
| b             | 2020-02-09     | 400.0         | 4   |
| b             | 2020-02-10     | 600.0         | 5   |
| c             | 2020-01-31     | 200.0         | 1   |
| c             | 2020-02-01     | 300.0         | 2   |
| c             | 2020-02-02     | 200.0         | 3   |
| c             | 2020-02-03     | 400.0         | 4   |
| c             | 2020-02-10     | 600.0         | 5   |
+---------------+----------------+---------------+-----+

3   用 ctime 字段减去 rn 字段 ,差值相同说明销售时间是连续的

select
*,
date_sub(ctime,rn)as diff
from
(
select
*,row_number() over(partition by name order by ctime)rn
from
tb_sell)t1;
+----------+-------------+----------+--------+-------------+
| t1.name  |  t1.ctime   | t1.cost  | t1.rn  |    diff     | diff=ctime-rn
+----------+-------------+----------+--------+-------------+
| a        | 2020-02-05  | 200.0    | 1      | 2020-02-04  |
| a        | 2020-02-06  | 300.0    | 2      | 2020-02-04  |
| a        | 2020-02-07  | 200.0    | 3      | 2020-02-04  |
| a        | 2020-02-08  | 400.0    | 4      | 2020-02-04  |
| a        | 2020-02-10  | 600.0    | 5      | 2020-02-05  |
| a        | 2020-03-01  | 200.0    | 6      | 2020-02-24  |
| a        | 2020-03-02  | 300.0    | 7      | 2020-02-24  |
| a        | 2020-03-03  | 200.0    | 8      | 2020-02-24  |
| a        | 2020-03-04  | 400.0    | 9      | 2020-02-24  |
| a        | 2020-03-05  | 600.0    | 10     | 2020-02-24  |
| b        | 2020-02-05  | 200.0    | 1      | 2020-02-04  |
| b        | 2020-02-06  | 300.0    | 2      | 2020-02-04  |
| b        | 2020-02-08  | 200.0    | 3      | 2020-02-05  |
| b        | 2020-02-09  | 400.0    | 4      | 2020-02-05  |
| b        | 2020-02-10  | 600.0    | 5      | 2020-02-05  |
| c        | 2020-01-31  | 200.0    | 1      | 2020-01-30  |
| c        | 2020-02-01  | 300.0    | 2      | 2020-01-30  |
| c        | 2020-02-02  | 200.0    | 3      | 2020-01-30  |
| c        | 2020-02-03  | 400.0    | 4      | 2020-01-30  |
| c        | 2020-02-10  | 600.0    | 5      | 2020-02-05  |
+----------+-------------+----------+--------+-------------+

4   将相同 diff 的进行count计数累加 : 按 name ,diff 分组 ,得出相同的name 相同diff 出现的次数然后累计相加 count,就可以得到连续出现的次数

select
name,diff,
count(1) days    然后相同diff进行相加
from
(
select
*,
date_sub(ctime,rn)as diff
from
(
select
*,
row_number() over(partition by name order by ctime)rn
from
tb_sell)t1)t2
group by name,diff    按 name和diff 分组
having days>3;        再然后将相加后大于3的diff过滤掉+-------+-------------+-------+
| name  |    diff     | days  |
+-------+-------------+-------+
| a     | 2020-02-04  | 4     |
| a     | 2020-02-24  | 5     |
| c     | 2020-01-30  | 4     |
+-------+-------------+-------+

5   因为有重复数据,需要对重复数据进行去重

select
distinct name    对名字相同的进行去重
from
( select
name,diff,
count(1) days
from
(
select
*,
date_sub(ctime,rn)as diff
from
(
select
*,
row_number() over(partition by name order by ctime)rn
from
tb_sell)t1
)t2
group by name,diff
having days>3) t3;
+-------+
| name  |  需求是 : 求出连续销售3天的店铺
+-------+
| a     |
| c     |
+-------+

案例2 : 找出连续击中三次土拔鼠的用户

一   数据源 ,将数据加入到 linux 本地,生成静态文件

数据源
uid,fight , hit(打中) 1为打中,0为没打中
u01,1,1
u01,2,0
u01,3,1
u01,4,1
u01,5,0
u01,6,1
u02,1,1
u02,2,1
u02,3,0
u02,4,1
u02,5,1
u02,6,0
u02,7,0
u02,8,1
u02,9,1
u03,1,1
u03,2,1
u03,3,1
u03,4,1
u03,5,1
u03,6,0将数据插入到linux本地的,生成静态文件
vi /root/hive/hitmouse.txt

二   建表,将数据加载到表里面,然后查询数据加载情况

删除表
drop table tb_hitmouse;创建一个普通的内部表 (内部表 : 删除表时,表目录下的对应的文件会被删除掉)
create table tb_hitmouse(
uid string,
seq int,
hit int
)
row format delimited fields terminated by ",";加载数据
load data local inpath "/root/hive/hitmouse.txt" into table tb_hitmouse;查看数据加载情况
select * from tb_hitmouse;
+------------------+--------------------+------------------+
| tb_hitmouse.uid  | tb_hitmouse.seq  | tb_hitmouse.hit  |
+------------------+--------------------+------------------+
| u01              | 1                  | 1                |
| u01              | 2                  | 0                |
| u01              | 3                  | 1                |
| u01              | 4                  | 1                |
| u01              | 5                  | 0                |
| u01              | 6                  | 1                |
| u02              | 1                  | 1                |
| u02              | 2                  | 1                |
| u02              | 3                  | 0                |
| u02              | 4                  | 1                |
| u02              | 5                  | 1                |
| u02              | 6                  | 0                |
| u02              | 7                  | 0                |
| u02              | 8                  | 1                |
| u02              | 9                  | 1                |
| u03              | 1                  | 1                |
| u03              | 2                  | 1                |
| u03              | 3                  | 1                |
| u03              | 4                  | 1                |
| u03              | 5                  | 1                |
| u03              | 6                  | 0                |
+------------------+--------------------+------------------+

三   需求实现思路和步骤

1   将用户没有打中的数据去除掉/过滤掉,只留下用户打中的数据(where hit=1),然后按照用户进行分区,给相同区域内的每一行编号 row_number()

select
* ,
row_number() over(partition by uid) rownumb
from
tb_hitmouse
where
hit=1;
+------------------+--------------------+------------------+----------------------+
| tb_hitmouse.uid  | tb_hitmouse.fight  | tb_hitmouse.hit  | row_number_window_0  |
+------------------+--------------------+------------------+----------------------+
| u01              | 1                  | 1                | 1                    |
| u01              | 3                  | 1                | 2                    |
| u01              | 4                  | 1                | 3                    |
| u01              | 6                  | 1                | 4                    |
| u02              | 5                  | 1                | 1                    |
| u02              | 9                  | 1                | 2                    |
| u02              | 8                  | 1                | 3                    |
| u02              | 4                  | 1                | 4                    |
| u02              | 2                  | 1                | 5                    |
| u02              | 1                  | 1                | 6                    |
| u03              | 5                  | 1                | 1                    |
| u03              | 2                  | 1                | 2                    |
| u03              | 1                  | 1                | 3                    |
| u03              | 3                  | 1                | 4                    |
| u03              | 4                  | 1                | 5                    |
+------------------+--------------------+------------------+----------------------+

2   在以上基础上 ,按照 seq 字段进行排序

select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1;
+------------------+------------------+------------------+----------+
| tb_hitmouse.uid  | tb_hitmouse.seq  | tb_hitmouse.hit  | rownumb  |
+------------------+------------------+------------------+----------+
| u01              | 1                | 1                | 1        |
| u01              | 3                | 1                | 2        |
| u01              | 4                | 1                | 3        |
| u01              | 6                | 1                | 4        |
| u02              | 1                | 1                | 1        |
| u02              | 2                | 1                | 2        |
| u02              | 4                | 1                | 3        |
| u02              | 5                | 1                | 4        |
| u02              | 8                | 1                | 5        |
| u02              | 9                | 1                | 6        |
| u03              | 1                | 1                | 1        |
| u03              | 2                | 1                | 2        |
| u03              | 3                | 1                | 3        |
| u03              | 4                | 1                | 4        |
| u03              | 5                | 1                | 5        |
+------------------+------------------+------------------+----------+

3   在以上基础上 ,因为 seq 和 rownumb 字段属性相同,可以直接进行相减 ,相减后得到的字段diff 相同,说明是连续击中的

select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t;
+--------+--------+--------+------------+-------+
| t.uid  | t.seq  | t.hit  | t.rownumb  | diff  |
+--------+--------+--------+------------+-------+
| u01    | 1      | 1      | 1          | 0     |
| u01    | 3      | 1      | 2          | 1     |
| u01    | 4      | 1      | 3          | 1     |
| u01    | 6      | 1      | 4          | 2     |
| u02    | 1      | 1      | 1          | 0     |
| u02    | 2      | 1      | 2          | 0     |
| u02    | 4      | 1      | 3          | 1     |
| u02    | 5      | 1      | 4          | 1     |
| u02    | 8      | 1      | 5          | 3     |
| u02    | 9      | 1      | 6          | 3     |
| u03    | 1      | 1      | 1          | 0     |
| u03    | 2      | 1      | 2          | 0     |
| u03    | 3      | 1      | 3          | 0     |
| u03    | 4      | 1      | 4          | 0     |
| u03    | 5      | 1      | 5          | 0     |
+--------+--------+--------+------------+-------+

4   在以上基础上 ,按照用户和diff 字段进行分组 ,然后将连续击中的进行相加/累加 ,得到 cc 字段

select
uid,
diff,
count(1) cc
from
(select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t)t1
group by uid,diff
;
+------+-------+-----+
| uid  | diff  | cc  |
+------+-------+-----+
| u01  | 0     | 1   |
| u01  | 1     | 2   |
| u01  | 2     | 1   |
| u02  | 0     | 2   |
| u02  | 1     | 2   |
| u02  | 3     | 2   |
| u03  | 0     | 5   |
+------+-------+-----+

5   在以上基础上 ,对 cc 字段进行过滤 ,留下 cc > 3 的用户

select
uid,
diff,
count(1) cc
from
(select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t)t1
group by uid,diff
having cc>3;
+------+-------+-----+
| uid  | diff  | cc  |
+------+-------+-----+
| u03  | 0     | 5   |
+------+-------+-----+

6   对用户进行去重

select
distinct uid
from
(select
uid,
diff,
count(1) cc
from
(select
*,
(seq-rownumb)diff
from
(select
* ,
row_number() over(partition by uid order by seq) rownumb
from
tb_hitmouse
where
hit=1)t)t1
group by uid,diff
having cc>3)t2;
+------+
| uid  |   需求是 : 找出连续击中三次土拔鼠的用户
+------+
| u03  |
+------+

Hive之窗口函数(partition) / order by / row_number / date_sub 等函数联合使用案例(9)相关推荐

  1. HIVE:窗口函数,用sql语句查询MySQL安装路径和版本

    数据大师: Jmx's Blog | Keep it Simple and Stupid! 猴子 - 知乎公众号(猴子数据分析)著有畅销书<数据分析思维>科普中国专家 回答数 647,获得 ...

  2. Hive分析窗口函数系列文章

    分析窗口函数应用场景: (1)用于分区排序 (2)动态Group By (3)Top N (4)累计计算 (5)层次查询 Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive中提供了越 ...

  3. Hive常用窗口函数实战

    点击蓝字关注! 本篇文章大概3799字,阅读时间大约10分钟 本文介绍了Hive常见的序列函数,排名函数和窗口函数.结合业务场景展示了Hive分析函数的使用 Hive作为一个分析型的数据仓库组件提供了 ...

  4. <Zhuuu_ZZ>HIVE(九)窗口函数

    Hive窗口函数 总览 一 数据准备 二 over开窗 二 partition by子句 三 order by 子句 四 window 子句 五 序列函数 ntile 六 lag和lead函数 七 f ...

  5. hive的窗口函数详解

    1.1 hive窗口函数 窗口函数是什么鬼? 窗口函数指定了函数工作的数据窗口大小(当前行的上下多少行),这个数据窗口大小可能会随着行的变化而变化.窗口函数和聚合函数区别? 窗口函数对于每个组返回多行 ...

  6. Hive分析窗口函数

    分析窗口函数应用场景: (1)用于分区排序 (2)动态Group By (3)Top N (4)累计计算 (5)层次查询 Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive中提供了越 ...

  7. 使用Hive的窗口函数进行数据分析——以股票市场分析为例

    声明:本文主要是实现利用Hive常用的窗口函数和一些数据分析思维分析数据,只是套用在股票数据的例子上,因此并不适用于提高投资技巧! 我们先看一下常用Hive中常用的窗口: PRECEDING:往前 F ...

  8. Hive分析窗口函数(一) SUM,AVG,MIN,MAX

    Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive分析窗口函数(一) SUM,AVG,MIN,MAX Hive中提供了越来越多的分析函数,用于完成负责的统计分析.抽时间将所有的分析窗 ...

  9. Hive 的窗口函数

    本文首发于:微信公众号[大数据每日哔哔,文章:Hive SQL 窗口函数 在 SQL 中有一类函数叫做聚合函数,例如 sum().avg().max().min() 等等,这类函数可以将多行数据按照规 ...

最新文章

  1. 埃森哲:技术改变看病的五大趋势!每个人都将受益【附下载】| 智东西内参...
  2. tableau可视化数据分析60讲(四)-tableau数据源操作数据提取
  3. SAP Engagement Center的ShellCarousel控件control
  4. html 点击按钮js自增,JS实现点击按钮自动增加一个单元格的方法
  5. 中文场景文字识别技术创新大赛,总奖池5.4万!
  6. linux脚本生成数字写入文本,4.2 编写Shell脚本(P80-85)——《Linux就该这么学》学习笔记16...
  7. FFmpeg的H.264解码器源代码简单分析:宏块解码(Decode)部分-帧间宏块(Inter)
  8. End Game----OO最后一次博客作业
  9. python:urllib.error.URLError: <urlopen error [Errno 11004] getaddrinfo failed>
  10. qgridlayout 滚动时固定第一行_【德国进口轴承】滚动轴承组合设计应考虑的问题...
  11. 方波峰峰值和有效值_峰峰值,峰值,平均值,有效值的关系
  12. php blowfish 解密,php blowfish加密解密算法
  13. 查询京东快递物流状态,快速筛选出代收的单号
  14. tom猫变声原理解析
  15. 程序员忽悠女朋友玩gal
  16. java获取sqlserver连接并插入数据
  17. AS3加载外部swf资源库中的元件(MovieClip)
  18. 最严“22条措施”打击市场乱象 云南旅游“浴火重生”
  19. 南大科院大数据Hadoop工程实训
  20. HTC ONE M7 ROOT后恢复 原始状态(保修)

热门文章

  1. FXO端口的断开呼叫的问题(转)
  2. html标签:表格、列表、图片、文字、表单、以及h5新增特性
  3. slurm学习笔记(一)
  4. k30最小宽度380不管用了_各场所疏散楼梯净宽度知识点归纳
  5. NodeJS C++ Addons之C++类实例包装与异步操作
  6. CSI笔记【2】:正交频分多路复用技术/OFDM
  7. matlab,多条曲线画到一张图上
  8. 第一章 MySQL数据库的简介
  9. 【计算机网络】分组交换和电路交换
  10. python sum函数的用法