文章目录

1.概述
- 1.1.窗口函数：
- 1.2.标准SQL语法
- 1.3.分析函数分类：
2.Top N:
3.案例

1.概述

转载：Clickhouse 分析函数 window functions 窗口函数

clickhouse 暂时么有提供标准SQL的分析函数，但是有计划添加此功能，预计2020年第四季度的早期实现。

在clickhouse 2020年第四季度之前可以使用clickhouse提供的函数来实现。

1.1.窗口函数：

Online Anallytical Processing，联机分析处理，可以对数据库数据进行实时分析处理

1.2.标准SQL语法

分析函数 over（partition by 列名 order by 列名 ）

1.3.分析函数分类：

聚合类

avg(列名)、sum(列名)、count(列名)、max(列名)、min(列名)

排名类

row_number() 按照值排序时产生一个自增编号，不会重复
rank() 按照值排序时产生一个自增编号，值相等时会重复，会产生空位
dense_rank() 按照值排序时产生一个自增编号，值相等时会重复，不会产生空位

注意：排名类分析函数不需要任何参数。

其他类

lag(列名,往前的行数,[行数为null时的默认值，不指定为null])
lead(列名,往后的行数,[行数为null时的默认值，不指定为null])
ntile(n) 用于将分组数据按照顺序切分成n片，返回当前切片值，如果切片不均匀，默认增加第一个切片的分布。

概述：
window function时 SQL：2003 新加入的标准。
常见的分析函数：

1.排名(rownumber,rank,dense_rank)2.sum() over()3.count() over()4.lead()/lag() over()5.ntile(n) over()

1.排名函数：

Clickhouse没有直接提供对应的开窗函数，需要使用clickhouse提供的函数来变相实现，主要涉及以下数组函数：

arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq

简单示例：

Clickhouse> with (select [100,200,300,200,400,500] ) as arr select arrayEnumerate(arr) as rownumber,arrayEnumerateDense(arr) as dense_rank,arrayEnumerateUniq(arr) as uniq_rank;WITH (SELECT [100, 200, 300, 200, 400, 500]) AS arr
SELECT arrayEnumerate(arr) AS rownumber, arrayEnumerateDense(arr) AS dense_rank, arrayEnumerateUniq(arr) AS uniq_rank┌─rownumber─────┬─dense_rank────┬─uniq_rank─────┐
│ [1,2,3,4,5,6] │ [1,2,3,2,4,5] │ [1,1,1,2,1,1] │
└───────────────┴───────────────┴───────────────┘1 rows in set. Elapsed: 0.004 sec.

arrayEnumerate 等同于row_number（）
arrayEnumerateDense 等同于dense_rank()
arrayEnumerateUniq 只返回了元素第一次出现的次数

数据准备：

CREATE TABLE t_data
ENGINE = Memory AS
WITH (SELECT ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', '90', '80', '90', '88', '90', '100', '77', '99', '90', '80']) AS dict
SELECT dict[(number % 10) + 1] AS id, dict[number - 10] AS score
FROM system.numbers
LIMIT 10SELECT *
FROM t_data┌─id─┬─score─┐
│ A  │ 90    │
│ A  │ 80    │
│ A  │ 90    │
│ A  │ 88    │
│ B  │ 90    │
│ B  │ 100   │
│ B  │ 77    │
│ C  │ 99    │
│ C  │ 90    │
│ C  │ 80    │
└────┴───────┘10 rows in set. Elapsed: 0.002 sec.

– 先查询：

SELECT id, groupArray(score) AS arr, arrayEnumerate(arr) AS rownumber, arrayEnumerateDense(arr) AS dense_rank, arrayEnumerateUniq(arr) AS uniq_rank
FROM
(SELECT *FROM t_dataORDER BY score DESC
)
GROUP BY id
ORDER BY id ASC┌─id─┬─arr───────────────────┬─rownumber─┬─dense_rank─┬─uniq_rank─┐
│ A  │ ['90','90','88','80'] │ [1,2,3,4] │ [1,1,2,3]  │ [1,2,1,1] │
│ B  │ ['90','77','100']     │ [1,2,3]   │ [1,2,3]    │ [1,1,1]   │
│ C  │ ['99','90','80']      │ [1,2,3]   │ [1,2,3]    │ [1,1,1]   │
└────┴───────────────────────┴───────────┴────────────┴───────────┘3 rows in set. Elapsed: 0.004 sec.

– 最终的查询：

SELECT id, score, rownumber, dense_rank, uniq_rank
FROM
(SELECT id, groupArray(score) AS arr, arrayEnumerate(arr) AS rownumber, arrayEnumerateDense(arr) AS dense_rank, arrayEnumerateUniq(arr) AS uniq_rankFROM (SELECT *FROM t_dataORDER BY score DESC)GROUP BY idORDER BY id ASC
)
ARRAY JOIN arr AS score, rownumber, dense_rank, uniq_rank
ORDER BY id ASC, rownumber ASC, dense_rank ASC┌─id─┬─score─┬─rownumber─┬─dense_rank─┬─uniq_rank─┐
│ A  │ 90    │         1 │          1 │         1 │
│ A  │ 90    │         2 │          1 │         2 │
│ A  │ 88    │         3 │          2 │         1 │
│ A  │ 80    │         4 │          3 │         1 │
│ B  │ 90    │         1 │          1 │         1 │
│ B  │ 77    │         2 │          2 │         1 │
│ B  │ 100   │         3 │          3 │         1 │
│ C  │ 99    │         1 │          1 │         1 │
│ C  │ 90    │         2 │          2 │         1 │
│ C  │ 80    │         3 │          3 │         1 │
└────┴───────┴───────────┴────────────┴───────────┘10 rows in set. Elapsed: 0.006 sec.

可以看到rank()函数的还是没有实现。

2.Top N:

取每个用户的Top 2 的分数：

可以看到id的分数可能有重复，需要使用distinct去重。

Clickhouse> select id,score,d_rank from (select id,score,d_rank from(select id,groupArray(score) as arr,arrayEnumerateDense(arr) AS d_rank from(select distinct id,score from t_data order by score desc) group by id) array join arr as score,d_rank order by id asc,d_rank asc) where d_rank <3;SELECT id, score, d_rank
FROM
(SELECT id, score, d_rankFROM (SELECT id, groupArray(score) AS arr, arrayEnumerateDense(arr) AS d_rankFROM (SELECT DISTINCT id, scoreFROM t_dataORDER BY score DESC)GROUP BY id)ARRAY JOIN arr AS score, d_rankORDER BY id ASC, d_rank ASC
)
WHERE d_rank < 3┌─id─┬─score─┬─d_rank─┐
│ A  │ 90    │      1 │
│ A  │ 88    │      2 │
│ B  │ 90    │      1 │
│ B  │ 77    │      2 │
│ C  │ 99    │      1 │
│ C  │ 90    │      2 │
└────┴───────┴────────┘6 rows in set. Elapsed: 0.014 sec.

3.同比环比计算：

同比：今年当月和去年的同一个月的数据，比如现在是2020年7月份，则需要和去年7月份的数据比较。英文名为Year-over-Year

环比：今年当月和今年的上一个月的数据比较，比如现在是2020年7月份则需要和2020年的6月份的数据比较，英文名为Month-over-Month

同比增长率= (本月数据-去年同月数据）/去年同月的数据环比增长率=（本月数据 - 上月数据)/上月数据

在clickhouse中没有提供标准SQL种的lead 和lag函数。我们可以借助于clickhouse的提供的neighbor函数来实现。

neighbor(column, offset[, default_value])

The result of the function depends on the affected data blocks and the order of data in the block.
If you make a subquery with ORDER BY and call the function from outside the subquery, you can get the expected result.

Parameters

column — A column name or scalar expression.
offset — The number of rows forwards or backwards from the current row of column. Int64.
default_value — Optional. The value to be returned if offset goes beyond the scope of the block. Type of data blocks affected.
Returned values

Value for column in offset distance from current row if offset value is not outside block bounds.
Default value for column if offset value is outside block bounds. If default_value is given, then it will be used.
Type: type of data blocks affected or default value type.

参考：
https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/
数据准备：

Clickhouse> with toDate('2019-01-01') as start_date select toStartOfMonth(start_date + (number*32)) month_start,(number+10)*100 amount from numbers(24);WITH toDate('2019-01-01') AS start_date
SELECT toStartOfMonth(start_date + (number * 32)) AS month_start, (number + 10) * 100 AS amount
FROM numbers(24)┌─month_start─┬─amount─┐
│  2019-01-01 │   1000 │
│  2019-02-01 │   1100 │
│  2019-03-01 │   1200 │
│  2019-04-01 │   1300 │
│  2019-05-01 │   1400 │
│  2019-06-01 │   1500 │
│  2019-07-01 │   1600 │
│  2019-08-01 │   1700 │
│  2019-09-01 │   1800 │
│  2019-10-01 │   1900 │
│  2019-11-01 │   2000 │
│  2019-12-01 │   2100 │
│  2020-01-01 │   2200 │
│  2020-02-01 │   2300 │
│  2020-03-01 │   2400 │
│  2020-04-01 │   2500 │
│  2020-05-01 │   2600 │
│  2020-06-01 │   2700 │
│  2020-07-01 │   2800 │
│  2020-08-01 │   2900 │
│  2020-10-01 │   3000 │
│  2020-11-01 │   3100 │
│  2020-12-01 │   3200 │
│  2021-01-01 │   3300 │
└─────────────┴────────┘24 rows in set. Elapsed: 0.003 sec. WITH toDate('2019-01-01') AS start_date
SELECT toStartOfMonth(start_date + (number * 32)) AS month_start, (number + 10) * 100 AS amount, neighbor(amount, -12) AS prev_year_amount, neighbor(amount, -1) AS prev_month_amount, if(prev_year_amount = 0, -999, round((amount - prev_year_amount) / prev_year_amount, 2)) AS year_over_year, if(prev_month_amount = 0, -999, round((amount - prev_month_amount) / prev_month_amount, 2)) AS month_over_month
FROM numbers(24)┌─month_start─┬─amount─┬─prev_year_amount─┬─prev_month_amount─┬─year_over_year─┬─month_over_month─┐
│  2019-01-01 │   1000 │                0 │                 0 │           -999 │             -999 │
│  2019-02-01 │   1100 │                0 │              1000 │           -999 │              0.1 │
│  2019-03-01 │   1200 │                0 │              1100 │           -999 │             0.09 │
│  2019-04-01 │   1300 │                0 │              1200 │           -999 │             0.08 │
│  2019-05-01 │   1400 │                0 │              1300 │           -999 │             0.08 │
│  2019-06-01 │   1500 │                0 │              1400 │           -999 │             0.07 │
│  2019-07-01 │   1600 │                0 │              1500 │           -999 │             0.07 │
│  2019-08-01 │   1700 │                0 │              1600 │           -999 │             0.06 │
│  2019-09-01 │   1800 │                0 │              1700 │           -999 │             0.06 │
│  2019-10-01 │   1900 │                0 │              1800 │           -999 │             0.06 │
│  2019-11-01 │   2000 │                0 │              1900 │           -999 │             0.05 │
│  2019-12-01 │   2100 │                0 │              2000 │           -999 │             0.05 │
│  2020-01-01 │   2200 │             1000 │              2100 │            1.2 │             0.05 │
│  2020-02-01 │   2300 │             1100 │              2200 │           1.09 │             0.05 │
│  2020-03-01 │   2400 │             1200 │              2300 │              1 │             0.04 │
│  2020-04-01 │   2500 │             1300 │              2400 │           0.92 │             0.04 │
│  2020-05-01 │   2600 │             1400 │              2500 │           0.86 │             0.04 │
│  2020-06-01 │   2700 │             1500 │              2600 │            0.8 │             0.04 │
│  2020-07-01 │   2800 │             1600 │              2700 │           0.75 │             0.04 │
│  2020-08-01 │   2900 │             1700 │              2800 │           0.71 │             0.04 │
│  2020-10-01 │   3000 │             1800 │              2900 │           0.67 │             0.03 │
│  2020-11-01 │   3100 │             1900 │              3000 │           0.63 │             0.03 │
│  2020-12-01 │   3200 │             2000 │              3100 │            0.6 │             0.03 │
│  2021-01-01 │   3300 │             2100 │              3200 │           0.57 │             0.03 │
└─────────────┴────────┴──────────────────┴───────────────────┴────────────────┴──────────────────┘24 rows in set. Elapsed: 0.003 sec.

结论：

3.案例

clickhouse数据模型之有序漏斗分析

参考：

参考：
https://en.wikipedia.org/wiki/SQL_window_function
https://en.wikipedia.org/wiki/SQL:2003

https://github.com/ClickHouse/ClickHouse/issues/9887
window view：
https://github.com/ClickHouse/ClickHouse/pull/8331

https://github.com/ClickHouse/ClickHouse/pull/5925
特殊函数：
sequenceMatch
windowfunnel

https://github.com/ClickHouse/ClickHouse/issues/5485
https://github.com/ClickHouse/ClickHouse/issues/5132
https://github.com/ClickHouse/ClickHouse/issues/1469

https://clickhouse.tech/docs/en/sql-reference/functions/other-functions/

【Clickhouse】Clickhouse 分析函数 window functions 窗口函数相关推荐

MariaDB Window Functions窗口函数分组取TOP N记录
窗口函数在MariaDB10.2版本里实现,其简化了复杂SQL的撰写,提高了可读性. 在某些方面,窗口函数类似于聚集函数, 但它不像聚集函数那样每组只返回一个值,窗口函数可以为每组返回多个值. 作为一 ...
postgresql Window Functions
Window Functions 窗口函数不会像非窗口聚合调用那样使行分组到单个输出行中.相反,这些行保留了它们单独的身份.不可以和GROUP BY. WHERE子句一起用.如果在执行窗口计算后需要过 ...
mysql 窗口函数最新一条_MySQL 8.0 窗口函数(window functions)
窗口函数(window functions)是数据库的标准功能之一,主流的数据库比如Oracle,PostgreSQL都支持窗口函数功能,MySQL 直到 8.0 版本才开始支持窗口函数. 窗口函数, ...
Flink 窗口函数（Window Functions）增量聚合函数
文章目录增量聚合函数(incremental aggregation functions) 归约函数(ReduceFunction) 聚合函数(AggregateFunction) 定义了窗口分配器 ...
clickhouse--Window Functions 窗口函数概念讲解及实际使用示例
1. clickhouse中的窗口函数 1.1. 功能实现情况 Window Functions 在clickhouse的需求和呼声很高,早期的版本需要借助array类函数[如groupArray, ...
3.5 Window Functions
3.5 窗口函数窗口函数针对当前行,对一组表行进行计算.这类似于使用聚合函数进行的计算.然而,窗口函数不会像非窗口函数那样将行分组聚合为一行输出,而是这些行保留各自的属性.而且,窗口函数能够不仅仅访 ...
Presto 文档学习之窗口函数（Window Functions）排名函数（Ranking Functions）
Hello!大家好,本人菜鸟一枚,最近在领导的要求下学习Prestodb,没办法只能打开官方文档来自我学习,对于英语渣的我来说无疑是痛苦的,看到的页面和天书一样谷歌翻译过来的吧很多句子看不懂想要具 ...
Flink 窗口函数（Window Functions）处理迟到数据
文章目录将迟到的数据放入侧输出流 Lambda架构:用一个流处理器,先快速的得到一个正确,近似正确的结果,然后在另外一层是一个批处理器,然后在它是一直等着的,等所有数据都到齐了,计算出一个最终准确的 ...
学习Oracle分析函数(Analytic Functions)
Oracle提供了一些功能很强大的分析函数,使用这些函数可以完成可能需要存储过程来实现的需求. 分析函数计算基于一组数据行的聚合值,它们不同于聚合函数的是,它们为每一组返回多行结果.分析函数是除ORD ...

【Clickhouse】Clickhouse 分析函数 window functions 窗口函数