一、Hive中collect_list和collect_set的区别

二、Hive中四种排序(order by、sort by、distribute by、cluster by)

2.1 order by
2.1 sort by
2.1 distribute by
2.1 cluster by

三、Hive的累加操作

四、Hive分析窗口函数

一、Hive中collect_list和collect_set的区别

Hive中collect相关的函数有collect_list和collect_set.

它们都是将分组中的某列转换为一个数组返回，不同的是collect_list不重而collect去重。

做简单的实验加深理解，创建一张实验用表，存放用户每天点播视频的记录：

create table t_visit_viedo (username string,video_name string
) partitioned by (day string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

我的本地数据目录路径：/home/hadoop/data，vi info.log：

john,PGD-585
sail,CJOD-160
john,PGD-876
sail,MIDD-799
sail,CJOD-160
fox,IPZ-218
fox,IPX-030
fox,KAWD-445

将数据加载到Hive表：

load data local inpath '/home/hadoop/data/info.log' overwrite into table t_visit_video partition(day='2019-09-18');

需求一：

按照用户分组，取出每个用户每天看过的所有视频的名字：
记住一句话：group by后面跟的词一定要在select中出现。

select username, collect_list(video_name) from t_visit_video
group by username;

这个SQL是要跑MapReduce作业的：

结果如下：

fox     ["IPZ-218","IPX-030","KAWD-445"]
john    ["PGD-585","PGD-876"]
sail    ["CJOD-160","MIDD-799","CJOD-160"]
Time taken: 224.447 seconds, Fetched: 3 row(s)

我们观察到上面的结果中，sail看的电影已经重复了；所以应该增加去重，使用collect_set，它与collect_list的区别就是增加了去重功能：

select username, collect_set(video_name) from t_visit_video
group by username;

输出结果：

fox     ["IPZ-218","IPX-030","KAWD-445"]
john    ["PGD-585","PGD-876"]
sail    ["CJOD-160","MIDD-799"]

sail的观看记录中CJOD-160只出现了一次，实现了去重效果。

突破group by限制：

我们还可以利用collect来突破group by的限制，Hive中在group by查询的时候要求出现在select后面的列都必须是出现在group by后面的，即select列必须是作为分组依据的列，但是有的时候我们想根据A进行分组然后随便取出每个分组中的一个B，代入到这个测试中就是按照用户进行分组，然后随便拿出一个他看过的视频名称即可：
video_name不是分组列，但我们以然能够取出这列中的数据

select username, collect_list(video_name)[0]
from t_visit_video
group by username;

取出分组中的第一列（下标也是从0开始的）：

fox     IPZ-218
john    PGD-585
sail    CJOD-160

二、Hive中四种排序(order by、sort by、distribute by、cluster by)

2.1 order by

语法：

order by //可以指定desc(降序)、asc(升序)

order by会对输入做全局排序，因此只有一个Reducer（多个Reducer无法保证全局有序），然而只有一个Reducer时，会导致输入规模较大时，消耗较长的计算时间。

实际举例：

1、Hive中创建表：

create table temperature (
year int,
temper float
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

2、本地目录：/home/hadoop/data/下创建文件：

2019-09-09   32
2019-09-10  34
2019-09-11  28
2019-09-12  24
2019-09-13  18
2019-09-14  29
2019-09-15  32
2019-09-16  35
2019-09-17  27
2019-09-18  25
2019-09-19  30
2019-09-20  25

2.2 sort by

sort by 不是全局排序，其在数据进入reducer前完成排序，因此，如果使用sort by进行排序，并且设置mapred.reduce.tsaks>1,则sort by只会保证每个reducer的输出有序，并不能保证全局有序。
sort by不同于order by，它不受Hive.mapred.mode属性的影响，sort by的数据只能保证在同一个reduce中的数据可以按指定字段排序。使用sort by你可以指定执行的reduce个数（通过set.mapred.reduce.tasks=n来指定），对输出的数据再执行归并排序，即可得到全部结果。

//设置reduce的个数为3；
set mapred.reduce.tasks
//查询此次任务中reduce的个数：
set mapred.reduce.tasks;

SQL语句：

select * from temperature sort by year;

2.3 distribute by

distribute by是控制在map端如何拆分数据给reduce端的。hive会根据distribute by后面列，对应reduce的个数进行分发，默认采用hash算法。sort by为每一个reduce产生一个排序文件。在有些情况下，你需要控制某个特定的行应该到哪个reducer，这通常是为了进行后续的聚集操作。distribute by刚好可以做这件事情。因此，distribute by经常和sort by配合使用。

//根据年份和气温对气象数据进行排序，以确保所有具有相同年份的行最终都在一个分区reducer中

注意：DIstribute by和sort by的使用场景
1、Map输出的文件和大小不均
2、reduce输出文件大小不均
3、小文件多
4、文件大

三、Hive的累加解法

需求一：

给出一批数据，现要求出：

每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数。

数据准备：三个字段的意思分别是：用户名，月份，访问次数

A,2015-01,5
A,2015-01,15
B,2015-01,5
A,2015-01,8
B,2015-01,25
A,2015-01,5
A,2015-02,4
A,2015-02,6
B,2015-02,10
B,2015-02,5
A,2015-03,16
A,2015-03,22
B,2015-03,23
B,2015-03,10
B,2015-03,11

最后结果展示：

用户   月份      最大访问次数  总访问次数       当月访问次数
A   2015-01     33          33        33
A   2015-02     33          43        10
A   2015-03     38          81        38
B   2015-01     30          30        30
B   2015-02     30          45              15
B   2015-03     44          89        44

解答：

1、在Hive中创建一个user_visit表：

create table user_visit (
name string,
month string,
num int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

2、从本地导入数据到Hive中：

load data local inpath ‘/home/hadoop/data/test.txt’ into table user_visit;

3、求得当月访问次数

select name,month,sum(num) sn from user_visit
group by name,month;

输出结果：

+-------+----------+-----+--+
| name  |  month   | sn  |
+-------+----------+-----+--+
| A     | 2015-01  | 33  |
| A     | 2015-02  | 10  |
| A     | 2015-03  | 38  |
| B     | 2015-01  | 30  |
| B     | 2015-02  | 15  |
| B     | 2015-03  | 44  |
+-------+----------+-----+--+

4、创建一张临时表tmp_user_visit用于保存3中的数据：

create table tmp_user_visit
as
select name,month,sum(num) sn from user_visit
group by name,month;

5、根据tmp_user_visit表求出访问次数，将其连同临时表中的数据放入新表tmp_user_visitb表中

create table tmp_user_visitb
as
select name,month,sn
sum(sn) over (partition by name order by month) as zcs
from tmp_user_visit;

6、根据临时表b求出最大访问次数，将其连同tmp_user_visitb的数据放入最终表formal_user_visit中：

create table formal_user_visit
as
select name,month,max(sn) over (partition by name order by month) as maxfw,zcs,sn
from tmp_user_visitb;

四、Hive分析窗口函数（Hive做累计、分组、排序、层次等计算）

应用场景：

1）、用于分区排序
2）、动态Group By
3）、Top N
4）、累计计算
5）、层次查询

Hive中提供了越来越多的分析函数，用于完成负责的统计分析。大致可以分为以下四类：

Hive分析窗口函数（一）：SUM,AVG,MIN,MAX

This section introdeces the Hive QL enhancements for windowing and analytics functions. See “Windowing Specifications in HQL” for details. Hive-896 has more information,including links to earlier documentation in the initial comments.

All of the windowing and analytics functions operator as per the SQL standard.

The current release supports the following functions for windowing and analytics:
1、Windowing functions（窗口函数）
- LEAD
  - The number of rows to lead can optionally be specified.If the number of rows to lead is not specified, the lead is one row.
  - Returns null when the lead for the current row extends beyond the end of the window
- LAG
  - The number of rows to lag can optionally be specified. If the number of rows to lag is not specified, the lag is one row.
  - Returns null when the lag for the current row extends before the beginning of the window.
- FIRST_VALUE
  - This takes at most two parameters.The fist parameter is the column for which you want the first value, the second (optioned) parameter must be a boolean which is false by default. If set to true it skips null values.
  - LAST_VALUE
    - This takes at most two parameters, The first parameter is the column for which you want the last value, the second (optional) parameter must be a boolean which is false by default. If set to true it skips null values.

Hive中的常用函数相关推荐

Hive中String常用函数
1. 字符 ascii 码函数:ascii 语法: ascii(string str) 返回值: int 说明:返回字符串 str 中第一个字符的ascii 码举例: hive> select ...
Hive中的常用的日期函数
Hive中的日期函数to_date(string timestamp)**:返回时间字符串中的日期部分,如to_date('1970-01-01 00:00:00')='1970-01-01'curr ...
Python之pandas：pandas中数据处理常用函数(与空值相关/去重和替代)简介、具体案例、使用方法之详细攻略
Python之pandas:pandas中数据处理常用函数(与空值相关/去重和替代)简介.具体案例.使用方法之详细攻略目录 pandas中数据处理常用函数(isnull/dropna/fillna/ ...
c语言中math的库函数,C语言中math.h库中的常用函数
C语言中math.h库中的常用函数 int abs(int i) 返回整型参数i的绝对值 double cabs(struct complex znum) 返回复数znum的绝对值 double fa ...
php spl函数,PHP SPL标准库中的常用函数介绍
这篇文章主要介绍了PHP SPL标准库中的常用函数介绍,本文着重讲解了spl_autoload_extensions().spl_autoload_register().spl_autoload()三 ...
json pandas 内存溢出_pandas中的常用函数
Pandas 是基于NumPy 的一种工具,该工具是为了解决数据分析任务而创建的.Pandas 纳入了大量库和一些标准的数据模型,提供了高效地操作大型数据集所需的工具. 近年随着大数据时代的到来,产生 ...
hive 元数据自定义_如何在Hive中创建自定义函数UDF及如何直接通过Impala的同步元数据重用UDF的jar文件-阿里云开发者社区...
如何在Hive中创建自定义函数UDF及使用如何在Impala中使用Hive的自定义函数 UDF函数开发使用Intellij工具开发Hive的UDF函数,进行编译: 1.使用Intellij工具通过 ...
[C++基础]队列queue中的常用函数
[C++基础]队列queue中的常用函数本博客转载自:https://www.cnblogs.com/xuning/p/3321733.html 在C++中只要#include即可使用队列类,其中在 ...
Hive中的多维分析函数
Hive中的多维分析函数众所周知,Hive是一个OLAP数据库,拥有着很强大的数据分析能力. 但是当我们一张表的维度信息特别多,并且需要各种维度组合的统计分析结果时就很不方便.比如一张表有5个维度, ...

Hive中的常用函数

一、Hive中collect_list和collect_set的区别

需求一：

二、Hive中四种排序(order by、sort by、distribute by、cluster by)

2.1 order by

实际举例：

2.2 sort by

2.3 distribute by

三、Hive的累加解法

需求一：

最后结果展示：

解答：

四、Hive分析窗口函数（Hive做累计、分组、排序、层次等计算）

Hive中的常用函数相关推荐

最新文章

热门文章