hive经典面试题4--如何用分时数据计算分时累计数据？如何在分时累计数据为空的时候去补全数据？

背景：
在互联网公司经常会用每10分钟为维度去统计某一个指标，这种统计我们称为分时数据。例如10分钟内某个渠道的安装数据，通过这个数据可以实时查看这个渠道质量好坏，如果这个渠道质量不行，那么就停止投放，避免不必要的损失。有时候会用每10分钟累计一次（我们称为分时累计数据），通过折线图去看这个数据的走势。

分时累计数据： 例如，00:00-00:10累计一次，00:00-00:20累计一次，00:00-00:30累计一次，每隔10分钟累计一次，这就是我们所说的分时累计数据。

那么问题来了，如何用分时数据计算分时累计数据呢？

有如下分时数据（将时间戳格式化成HH:mm）(可以累加的pv数据，如果涉及到uv是不可以累加的)，怎么去统计每10分的累计数据呢？
说到累计，我们很容易就想到了sum() over()这个函数

apptypeid	hh	num
1001	00:10	8
1001	00:20	7
1001	00:40	6
1001	00:50	10
1001	01:00	12

计算逻辑如下：

spark-sql> with test_hh as > (select > '1001' as apptypeid,'00:10' as hh,8 as num> union all> select '1001' as apptypeid,'00:20' as hh,7 as num> union all> select '1001' as apptypeid,'00:40' as hh,6 as num> union all> select '1001' as apptypeid,'00:50' as hh,10 as num> union all> select '1001' as apptypeid,'01:00' as hh,12 as num> )> select >   apptypeid,>   hh,>   num,>   sum(num) over(distribute by apptypeid sort by hh) as num_total> from test_hh;
1001    00:10       8       8
1001    00:20       7       15
1001    00:40       6       21
1001    00:50       10      31
1001    01:00       12      43
Time taken: 2.864 seconds, Fetched 5 row(s)

通过上面的结果，我们可以看出num_total这一列值，是累加后的结果。但是这个结果在展示的时候，就会出现上面图中的情况，从00:20到00:40中间的点会断，就会出现直线向下的现象。

怎么去解决这个问题？

那我们是不是在00:30这个点去补一条数据就可以了，那怎么补呢？补什么数据呢？
如果我们把00:30这个点补一个0，去参与累加计算是不是就不会出现中间断点的情况了，而且不会影响最终的累加结果。

apptypeid	hh	num
1001	00:10	8
1001	00:20	7
1001	00:30	0
1001	00:40	6
1001	00:50	10
1001	01:00	12

那问题又来了，我们怎么知道要补哪个点呢？

我们出一个所有点的配置表，给每个点一个默认值0，再去关联我们的数据，如果关联不上的用我们的默认值给填充，上代码：
with的用法点击此处

spark-sql> with test_hh as > (select > '1001' as apptypeid,'00:10' as hh,8 as num> union all> select '1001' as apptypeid,'00:20' as hh,7 as num> union all> select '1001' as apptypeid,'00:40' as hh,6 as num> union all> select '1001' as apptypeid,'00:50' as hh,10 as num> union all> select '1001' as apptypeid,'01:00' as hh,12 as num> ),> --所有分时的配置表> test_hh_config as > (select > '1001' as apptypeid,'00:10' as hh> union all> select '1001' as apptypeid,'00:20' as hh> union all> select '1001' as apptypeid,'00:30' as hh> union all> select '1001' as apptypeid,'00:40' as hh> union all> select '1001' as apptypeid,'00:50' as hh> union all> select '1001' as apptypeid,'01:00' as hh> )> select >   apptypeid,>   hh,>   num,>   sum(num) over(distribute by apptypeid sort by hh) as num_total> from > (select >   apptypeid,>   hh,>   max(num) as num> from > (select >   apptypeid,>   hh,>   0 as num> from test_hh_config> > union all> select >   apptypeid,>   hh,>   num> from test_hh) t1> group by apptypeid,hh) t2;
1001    00:10       8       8
1001    00:20       7       15
1001    00:30       0       15
1001    00:40       6       21
1001    00:50       10      31
1001    01:00       12      43
Time taken: 14.713 seconds, Fetched 6 row(s)

通过分时的配置表给00:30这个点，填充了一个默认值0，正好符合我们不影响累加数据结果的要求。在页面上显示的时候，也会是趋于平滑的曲线。

注意：上面我们在关联的时候，用了union all，这个在大表关联跑不动或者内存溢出的时候，是非要有效的，也是面试经常要问的。可查看五种去重方式

问题延伸：

在实际线业务上，我们会有很多维度，例如渠道、版本、操作系统、新老用户等等一些列维度。如果去补这种多维度的分时累计的点怎么补呢？
这个时候我们就要将所有的分时数据出一张配置表，再将我们所有去重后的维度数据，跟这个分时的配置表去笛卡尔积，保证每个交叉维度，有所有的分时点，这样去关联实际数据的时候才不会出现丢失点的问题。
逻辑如下：

selectt1.apptypeid,t1.type,t1.os,t1.qid,t1.ver,t1.isnew,t2.hh,t1.dt
from
(select apptypeid,type,os,qid,ver,isnew,dt
from test.test_data_detail
where dt='vardate'
group by apptypeid,type,os,qid,ver,isnew,dt) t1cross join
(select hh
from config_center.config_hh_minute) t2;

小伙伴在实际生产中，有出现类似问题，可以给博主留言