Hive函数

一、常用内置函数
- 内置函数查看
- 关系运算
- 数学运算
- 逻辑运算
- 数值计算
- 日期函数
- 条件函数
- 字符串函数
- 集合统计函数
二、lateral view 与 explode以及reflect和窗口函数
- lateral view与explode
- 列转行
- 行转列
- reflect函数
- 窗口函数
三、自定义函数
- UDF、UDAF、UDTF比较
- 自定义UDF
- 自定义 UDAF
- 自定义UDTF
- 创建临时函数
- 创建永久函数

Hive函数可以大致分为一下三类：

一、常用内置函数

内置函数查看

1）查看系统自带的函数hive> show functions;2）显示自带的函数的用法hive> desc function upper;3）详细显示自带的函数的用法hive> desc function extended upper;

详细可以查看Hive官方文档

关系运算

1、等值比较: =
语法：A=B
操作类型：所有基本类型
描述: 如果表达式A与表达式B相等，则为TRUE；否则为FALSE

select 1 from tableName where 1=1;

2、不等值比较: <>
语法: A <> B
操作类型: 所有基本类型
描述: 如果表达式A为NULL，或者表达式B为NULL，返回NULL；如果表达式A与表达式B不相等，则为TRUE；否则为FALSE

 select 1 from tableName where 1 <> 2;

3、小于比较: <
语法: A < B
操作类型：所有基本类型
描述: 如果表达式A为NULL，或者表达式B为NULL，返回NULL；如果表达式A小于表达式B，则为TRUE；否则为FALSE

select 1 from tableName where 1 < 2;

4、小于等于比较: <=
语法: A <= B
操作类型: 所有基本类型
描述: 如果表达式A为NULL，或者表达式B为NULL，返回NULL；如果表达式A小于或者等于表达式B，则为TRUE；否则为FALSE

select 1 from tableName where 1 < = 1;

5、大于比较: >
语法: A > B
操作类型: 所有基本类型
描述: 如果表达式A为NULL，或者表达式B为NULL，返回NULL；如果表达式A大于表达式B，则为TRUE；否则为FALSE

select 1 from tableName where 2 > 1;

6、大于等于比较: >=
语法: A >= B
操作类型: 所有基本类型
描述: 如果表达式A为NULL，或者表达式B为NULL，返回NULL；如果表达式A大于或者等于表达式B，则为TRUE；否则为FALSE

select 1 from tableName where 1 >= 1;

注意：String的比较要注意(常用的时间比较可以先 to_date 之后再比较)

select * from tableName;
OK
2011111209 00:00:00 2011111209
select a, b, a<b, a>b, a=b from tableName;
2011111209 00:00:00 2011111209 false true false

7、空值判断: IS NULL
语法: A IS NULL
操作类型: 所有类型
描述: 如果表达式A的值为NULL，则为TRUE；否则为FALSE

 select 1 from tableName where null is null;

8、非空判断: IS NOT NULL
语法: A IS NOT NULL
操作类型: 所有类型
描述: 如果表达式A的值为NULL，则为FALSE；否则为TRUE

select 1 from tableName where 1 is not null;

9、LIKE比较: LIKE
语法: A LIKE B
操作类型: strings
描述: 如果字符串A或者字符串B为NULL，则返回NULL；如果字符串A符合表达式B 的正则语法，则为TRUE；否则为FALSE。B中字符”_”表示任意单个字符，而字符”%”表示任意数量的字符。

select 1 from tableName where 'football' like 'foot%';
select 1 from tableName where 'football' like 'foot____';

注意：否定比较时候用NOT A LIKE B

select 1 from tableName where NOT 'football' like 'fff%';

10、JAVA的LIKE操作: RLIKE
语法: A RLIKE B
操作类型: strings
描述: 如果字符串A或者字符串B为NULL，则返回NULL；如果字符串A符合JAVA正则表达式B的正则语法，则为TRUE；否则为FALSE。

select 1 from tableName where 'footbar' rlike '^f.*r$';
1

注意：判断一个字符串是否全为数字：

select 1 from tableName where '123456' rlike '^\\d+$';
1
select 1 from tableName where '123456aa' rlike '^\\d+$';

11、REGEXP操作: REGEXP
语法: A REGEXP B
操作类型: strings
描述: 功能与RLIKE相同

select 1 from tableName where 'footbar' REGEXP '^f.*r$';
1

数学运算

1、加法操作: +
语法: A + B
操作类型：所有数值类型
说明：返回A与B相加的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。比如，int + int 一般结果为int类型，而 int + double 一般结果为double类型

select 1 + 9 from tableName;
10

2、减法操作: -
语法: A – B
操作类型：所有数值类型
说明：返回A与B相减的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。比如，int – int 一般结果为int类型，而 int – double 一般结果为double类型

select 10 – 5 from tableName;
5

3、乘法操作: *
语法: A * B
操作类型：所有数值类型
说明：返回A与B相乘的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。注意，如果A乘以B的结果超过默认结果类型的数值范围，则需要通过cast将结果转换成范围更大的数值类型

select 40 * 5 from tableName;
200

4、除法操作: /
语法: A / B
操作类型：所有数值类型
说明：返回A除以B的结果。结果的数值类型为double

select 40 / 5 from tableName;
8.0

注意：hive中最高精度的数据类型是double,只精确到小数点后16位，在做除法运算的时候要特别注意

select ceil(28.0/6.999999999999999999999) from tableName limit 1;
4
select ceil(28.0/6.99999999999999) from tableName limit 1;
5

5、取余操作: %
语法: A % B
操作类型：所有数值类型
说明：返回A除以B的余数。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。

select 41 % 5 from tableName;
1
select 8.4 % 4 from tableName;
0.40000000000000036

注意：精度在hive中是个很大的问题，类似这样的操作最好通过round指定精度

select round(8.4 % 4 , 2) from tableName;
0.4

6、位与操作: &
语法: A & B
操作类型：所有数值类型
说明：返回A和B按位进行与操作的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。

select 4 & 8 from tableName;
0
select 6 & 4 from tableName;
4

7、位或操作: |
语法: A | B
操作类型：所有数值类型
说明：返回A和B按位进行或操作的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。

select 4 | 8 from tableName;
12
select 6 | 8 from tableName;
14

8、位异或操作: ^
语法: A ^ B
操作类型：所有数值类型
说明：返回A和B按位进行异或操作的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。

select 4 ^ 8 from tableName;
12
select 6 ^ 4 from tableName;
2

9．位取反操作: ~
语法: ~A
操作类型：所有数值类型
说明：返回A按位取反操作的结果。结果的数值类型等于A的类型。

select ~6 from tableName;
-7
select ~4 from tableName;
-5

逻辑运算

1、逻辑与操作: AND
语法: A AND B
操作类型：boolean
说明：如果A和B均为TRUE，则为TRUE；否则为FALSE。如果A为NULL或B为NULL，则为NULL

select 1 from tableName where 1=1 and 2=2;
1

2、逻辑或操作: OR
语法: A OR B
操作类型：boolean
说明：如果A为TRUE，或者B为TRUE，或者A和B均为TRUE，则为TRUE；否则为FALSE

select 1 from tableName where 1=2 or 2=2;
1

3、逻辑非操作: NOT
语法: NOT A
操作类型：boolean
说明：如果A为FALSE，或者A为NULL，则为TRUE；否则为FALSE

select 1 from tableName where not 1=2;
1

数值计算

1、取整函数: round ***
语法: round(double a)
返回值: BIGINT
说明: 返回double类型的整数值部分（遵循四舍五入）

select round(3.1415926) from tableName;
3
select round(3.5) from tableName;
4

2、指定精度取整函数: round ***
语法: round(double a, int d)
返回值: DOUBLE
说明: 返回指定精度d的double类型

select round(3.1415926,4) from tableName;
3.1416

3、向下取整函数: floor ***
语法: floor(double a)
返回值: BIGINT
说明: 返回等于或者小于该double变量的最大的整数

select floor(3.1415926) from tableName;
3

4、向上取整函数: ceil ***
语法: ceil(double a)
返回值: BIGINT
说明: 返回等于或者大于该double变量的最小的整数

select ceil(3.1415926) from tableName;
4

5、向上取整函数: ceiling ***
语法: ceiling(double a)
返回值: BIGINT
说明: 与ceil功能相同

select ceiling(3.1415926) from tableName;
4

6、取随机数函数: rand ***
语法: rand(),rand(int seed)
返回值: double
说明: 返回一个0到1范围内的随机数。如果指定种子seed，则会等到一个稳定的随机数序列

select rand() from tableName;
0.5577432776034763
select rand(100) from tableName;
0.7220096548596434

7、自然指数函数: exp
语法: exp(double a)
返回值: double
说明: 返回自然对数e的a次方

select exp(2) from tableName;
7.38905609893065

自然对数函数: ln

语法: ln(double a)

返回值: double

说明: 返回a的自然对数

8、以10为底对数函数: log10
语法: log10(double a)
返回值: double
说明: 返回以10为底的a的对数

select log10(100) from tableName;
2.0

9、以2为底对数函数: log2
语法: log2(double a)
返回值: double
说明: 返回以2为底的a的对数

select log2(8) from tableName;
3.0

10、对数函数: log
语法: log(double base, double a)
返回值: double
说明: 返回以base为底的a的对数

select log(4,256) from tableName;
4.0

11、幂运算函数: pow
语法: pow(double a, double p)
返回值: double
说明: 返回a的p次幂

select pow(2,4) from tableName;
16.0

12、幂运算函数: power
语法: power(double a, double p)
返回值: double
说明: 返回a的p次幂,与pow功能相同

select power(2,4) from tableName;
16.0

13、开平方函数: sqrt
语法: sqrt(double a)
返回值: double
说明: 返回a的平方根

select sqrt(16) from tableName;
4.0

14、二进制函数: bin
语法: bin(BIGINT a)
返回值: string
说明: 返回a的二进制代码表示

select bin(7) from tableName;
111

15、十六进制函数: hex
语法: hex(BIGINT a)
返回值: string
说明: 如果变量是int类型，那么返回a的十六进制表示；如果变量是string类型，则返回该字符串的十六进制表示

select hex(17) from tableName;
11
select hex(‘abc’) from tableName;
616263

16、反转十六进制函数: unhex
语法: unhex(string a)
返回值: string
说明: 返回该十六进制字符串所代码的字符串

select unhex(‘616263’) from tableName;
abc
select unhex(‘11’) from tableName;
-
select unhex(616263) from tableName;
abc

17、进制转换函数: conv
语法: conv(BIGINT num, int from_base, int to_base)
返回值: string
说明: 将数值num从from_base进制转化到to_base进制

select conv(17,10,16) from tableName;
11
select conv(17,10,2) from tableName;
10001

18、绝对值函数: abs
语法: abs(double a) abs(int a)
返回值: double int
说明: 返回数值a的绝对值

select abs(-3.9) from tableName;
3.9

19、正取余函数: pmod
语法: pmod(int a, int b),pmod(double a, double b)
返回值: int double
说明: 返回正的a除以b的余数

select pmod(9,4) from tableName;
1
select pmod(-9,4) from tableName;
3

20、正弦函数: sin
语法: sin(double a)
返回值: double
说明: 返回a的正弦值

select sin(0.8) from tableName;
0.7173560908995228

21、反正弦函数: asin
语法: asin(double a)
返回值: double
说明: 返回a的反正弦值

select asin(0.7173560908995228) from tableName;
0.8

22、余弦函数: cos
语法: cos(double a)
返回值: double
说明: 返回a的余弦值

select cos(0.9) from tableName;
0.6216099682706644

23、反余弦函数: acos
语法: acos(double a)
返回值: double
说明: 返回a的反余弦值

select acos(0.6216099682706644) from tableName;
0.9

24、positive函数: positive
语法: positive(int a), positive(double a)
返回值: int double
说明: 返回a

select positive(-10) from tableName;
-10

25、negative函数: negative
语法: negative(int a), negative(double a)
返回值: int double
说明: 返回-a

select negative(-5) from tableName;
5
select negative(8) from tableName;
-8

日期函数

1、UNIX时间戳转日期函数: from_unixtime ***
语法: from_unixtime(bigint unixtime[, string format])
返回值: string
说明: 转化UNIX时间戳（从1970-01-01 00:00:00 UTC到指定时间的秒数）到当前时区的时间格式

select from_unixtime(1323308943,'yyyyMMdd') from tableName;
20111208

2、获取当前UNIX时间戳函数: unix_timestamp ***
语法: unix_timestamp()
返回值: bigint
说明: 获得当前时区的UNIX时间戳

select unix_timestamp() from tableName;
1323309615

3、日期转UNIX时间戳函数: unix_timestamp ***
语法: unix_timestamp(string date)
返回值: bigint
说明: 转换格式为"yyyy-MM-dd HH:mm:ss"的日期到UNIX时间戳。如果转化失败，则返回0。

select unix_timestamp('2011-12-07 13:01:03') from tableName;
1323234063

4、指定格式日期转UNIX时间戳函数: unix_timestamp ***
语法: unix_timestamp(string date, string pattern)
返回值: bigint
说明: 转换pattern格式的日期到UNIX时间戳。如果转化失败，则返回0。

select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName;
1323234063

5、日期时间转日期函数: to_date ***
语法: to_date(string timestamp)
返回值: string
说明: 返回日期时间字段中的日期部分。

select to_date('2011-12-08 10:03:01') from tableName;
2011-12-08

6、日期转年函数: year ***
语法: year(string date)
返回值: int
说明: 返回日期中的年。

select year('2011-12-08 10:03:01') from tableName;
2011
select year('2012-12-08') from tableName;
2012

7、日期转月函数: month ***
语法: month (string date)
返回值: int
说明: 返回日期中的月份。

select month('2011-12-08 10:03:01') from tableName;
12

8、日期转天函数: day ****
语法: day (string date)
返回值: int
说明: 返回日期中的天。

select day('2011-12-08 10:03:01') from tableName;
8

9、日期转小时函数: hour ***
语法: hour (string date)
返回值: int
说明: 返回日期中的小时。

select hour('2011-12-08 10:03:01') from tableName;
10

10、日期转分钟函数: minute
语法: minute (string date)
返回值: int
说明: 返回日期中的分钟。

select minute('2011-12-08 10:03:01') from tableName;
3

11、日期转秒函数: second
语法: second (string date)
返回值: int
说明: 返回日期中的秒。

select second('2011-12-08 10:03:01') from tableName;
1

12、日期转周函数: weekofyear
语法: weekofyear (string date)
返回值: int
说明: 返回日期在当前的周数。

select weekofyear('2011-12-08 10:03:01') from tableName;
49

13、日期比较函数: datediff ***
语法: datediff(string enddate, string startdate)
返回值: int
说明: 返回结束日期减去开始日期的天数。

select datediff('2012-12-08','2012-05-09') from tableName;
213

14、日期增加函数: date_add ***
语法: date_add(string startdate, int days)
返回值: string
说明: 返回开始日期startdate增加days天后的日期。

select date_add('2012-12-08',10) from tableName;
2012-12-18

15、日期减少函数: date_sub ***
语法: date_sub (string startdate, int days)
返回值: string
说明: 返回开始日期startdate减少days天后的日期。

select date_sub('2012-12-08',10) from tableName;
2012-11-28

条件函数

1、If函数: if ***
语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull)
返回值: T
说明: 当条件testCondition为TRUE时，返回valueTrue；否则返回valueFalseOrNull

select if(1=2,100,200) from tableName;
200
select if(1=1,100,200) from tableName;
100

2、非空查找函数: COALESCE
语法: COALESCE(T v1, T v2, …)
返回值: T
说明: 返回参数中的第一个非空值；如果所有值都为NULL，那么返回NULL

select COALESCE(null,'100','50') from tableName;
100

3、条件判断函数：CASE ***
语法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END
返回值: T
说明：如果a等于b，那么返回c；如果a等于d，那么返回e；否则返回f

Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
mary
Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName;
tim

4、条件判断函数：CASE ****
语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END
返回值: T
说明：如果a为TRUE,则返回b；如果c为TRUE，则返回d；否则返回e

select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
mary
select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName;
tom

字符串函数

1、字符串长度函数：length
语法: length(string A)
返回值: int
说明：返回字符串A的长度

select length('abcedfg') from tableName;
7

2、字符串反转函数：reverse
语法: reverse(string A)
返回值: string
说明：返回字符串A的反转结果

select reverse('abcedfg') from tableName;
gfdecba

3、字符串连接函数：concat ***
语法: concat(string A, string B…)
返回值: string
说明：返回输入字符串连接后的结果，支持任意个输入字符串

select concat('abc','def','gh') from tableName;
abcdefgh

4、带分隔符字符串连接函数：concat_ws ***
语法: concat_ws(string SEP, string A, string B…)
返回值: string
说明：返回输入字符串连接后的结果，SEP表示各个字符串间的分隔符

select concat_ws(',','abc','def','gh')from tableName;
abc,def,gh

5、字符串截取函数：substr,substring ****
语法: substr(string A, int start),substring(string A, int start)
返回值: string
说明：返回字符串A从start位置到结尾的字符串

select substr('abcde',3) from tableName;
cde
select substring('abcde',3) from tableName;
cde
select substr('abcde',-1) from tableName; （和ORACLE相同）
e

6、字符串截取函数：substr,substring ****
语法: substr(string A, int start, int len),substring(string A, int start, int len)
返回值: string
说明：返回字符串A从start位置开始，长度为len的字符串

select substr('abcde',3,2) from tableName;
cd
select substring('abcde',3,2) from tableName;
cd
select substring('abcde',-2,2) from tableName;
de

7、字符串转大写函数：upper,ucase ****
语法: upper(string A) ucase(string A)
返回值: string
说明：返回字符串A的大写格式

select upper('abSEd') from tableName;
ABSED
select ucase('abSEd') from tableName;
ABSED

8、字符串转小写函数：lower,lcase ***
语法: lower(string A) lcase(string A)
返回值: string
说明：返回字符串A的小写格式

select lower('abSEd') from tableName;
absed
select lcase('abSEd') from tableName;
absed

9、去空格函数：trim ***
语法: trim(string A)
返回值: string
说明：去除字符串两边的空格

select trim(' abc ') from tableName;
abc

10、左边去空格函数：ltrim
语法: ltrim(string A)
返回值: string
说明：去除字符串左边的空格

select ltrim(' abc ') from tableName;
abc

11、右边去空格函数：rtrim
语法: rtrim(string A)
返回值: string
说明：去除字符串右边的空格

select rtrim(' abc ') from tableName;
abc

12、正则表达式替换函数：regexp_replace
语法: regexp_replace(string A, string B, string C)
返回值: string
说明：将字符串A中的符合java正则表达式B的部分替换为C。注意，在有些情况下要使用转义字符,类似oracle中的regexp_replace函数。

select regexp_replace('foobar', 'oo|ar', '') from tableName;
fb

13、正则表达式解析函数：regexp_extract
语法: regexp_extract(string subject, string pattern, int index)
返回值: string
说明：将字符串subject按照pattern正则表达式的规则拆分，返回index指定的字符。

select regexp_extract('foothebar', 'foo(.*?)(bar)', 1) from tableName;
the
select regexp_extract('foothebar', 'foo(.*?)(bar)', 2) from tableName;
bar
select regexp_extract('foothebar', 'foo(.*?)(bar)', 0) from tableName;
foothebar

注意，在有些情况下要使用转义字符，下面的等号要用双竖线转义，这是java正则表达式的规则。

select data_field,
regexp_extract(data_field,'.*?bgStart\\=([^&]+)',1) as aaa,
regexp_extract(data_field,'.*?contentLoaded_headStart\\=([^&]+)',1) as bbb,
regexp_extract(data_field,'.*?AppLoad2Req\\=([^&]+)',1) as ccc
from pt_nginx_loginlog_st
where pt = '2012-03-26' limit 2;

14、URL解析函数：parse_url ****
语法: parse_url(string urlString, string partToExtract [, string keyToExtract])
返回值: string
说明：返回URL中指定的部分。partToExtract的有效值为：HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.

select parse_url ('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') from tableName;
http://www.tableName.com
select parse_url ('https://www.tableName.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') from tableName;
v1

15、json解析函数：get_json_object ****
语法: get_json_object(string json_string, string path)
返回值: string
说明：解析json的字符串json_string,返回path指定的内容。如果输入的json字符串无效，那么返回NULL。

select get_json_object('{"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

16、空格字符串函数：space
语法: space(int n)
返回值: string
说明：返回长度为n的字符串

select space(10) from tableName;
select length(space(10)) from tableName;
10

17、重复字符串函数：repeat ***
语法: repeat(string str, int n)
返回值: string
说明：返回重复n次后的str字符串

select repeat('abc',5) from tableName;
abcabcabcabcabc

18、首字符ascii函数：ascii
语法: ascii(string str)
返回值: int
说明：返回字符串str第一个字符的ascii码

select ascii('abcde') from tableName;
97

19、左补足函数：lpad
语法: lpad(string str, int len, string pad)
返回值: string
说明：将str进行用pad进行左补足到len位

select lpad('abc',10,'td') from tableName;
tdtdtdtabc

注意：与GP，ORACLE不同，pad 不能默认
20、右补足函数：rpad
语法: rpad(string str, int len, string pad)
返回值: string
说明：将str进行用pad进行右补足到len位

select rpad('abc',10,'td') from tableName;
abctdtdtdt

21、分割字符串函数: split ****
语法: split(string str, string pat)
返回值: array
说明: 按照pat字符串分割str，会返回分割后的字符串数组

select split('abtcdtef','t') from tableName;
["ab","cd","ef"]

22、集合查找函数: find_in_set
语法: find_in_set(string str, string strList)
返回值: int
说明: 返回str在strlist第一次出现的位置，strlist是用逗号分割的字符串。如果没有找该str字符，则返回0

select find_in_set('ab','ef,ab,de') from tableName;
2
select find_in_set('at','ef,ab,de') from tableName;
0

集合统计函数

1、个数统计函数: count ***
语法: count(), count(expr), count(DISTINCT expr[, expr_.])
返回值: int
说明: count()统计检索出的行的个数，包括NULL值的行；count(expr)返回指定字段的非空值的个数；count(DISTINCT expr[, expr_.])返回指定字段的不同的非空值的个数

select count(*) from tableName;
20
select count(distinct t) from tableName;
10

2、总和统计函数: sum ***
语法: sum(col), sum(DISTINCT col)
返回值: double
说明: sum(col)统计结果集中col的相加的结果；sum(DISTINCT col)统计结果中col不同值相加的结果

select sum(t) from tableName;
select sum(distinct t) from tableName;

3、平均值统计函数: avg ***
语法: avg(col), avg(DISTINCT col)
返回值: double
说明: avg(col)统计结果集中col的平均值；avg(DISTINCT col)统计结果中col不同值相加的平均值

select avg(t) from tableName;
select avg (distinct t) from tableName;

4、最小值统计函数: min ***
语法: min(col)
返回值: double
说明: 统计结果集中col字段的最小值

select min(t) from tableName;

5、最大值统计函数: max ***
语法: maxcol)
返回值: double
说明: 统计结果集中col字段的最大值

select max(t) from tableName;

6、非空集合总体变量函数: var_pop
语法: var_pop(col)
返回值: double
说明: 统计结果集中col非空集合的总体变量（忽略null）

7、非空集合样本变量函数: var_samp
语法: var_samp (col)
返回值: double
说明: 统计结果集中col非空集合的样本变量（忽略null）

8、总体标准偏离函数: stddev_pop
语法: stddev_pop(col)
返回值: double
说明: 该函数计算总体标准偏离，并返回总体变量的平方根，其返回值与VAR_POP函数的平方根相同

9、样本标准偏离函数: stddev_samp
语法: stddev_samp (col)
返回值: double
说明: 该函数计算样本标准偏离

10．中位数函数: percentile
语法: percentile(BIGINT col, p)
返回值: double
说明: 求准确的第pth个百分位数，p必须介于0和1之间，但是col字段目前只支持整数，不支持浮点数类型

11、中位数函数: percentile
语法: percentile(BIGINT col, array(p1 [, p2]…))
返回值: array
说明: 功能和上述类似，之后后面可以输入多个百分位数，返回类型也为array，其中为对应的百分位数。

select percentile(score,&lt;0.2,0.4>) from tableName； 取0.2，0.4位置的数据

12、近似中位数函数: percentile_approx
语法: percentile_approx(DOUBLE col, p [, B])
返回值: double
说明: 求近似的第pth个百分位数，p必须介于0和1之间，返回类型为double，但是col字段支持浮点类型。参数B控制内存消耗的近似精度，B越大，结果的准确度越高。默认为10,000。当col字段中的distinct值的个数小于B时，结果为准确的百分位数

13、近似中位数函数: percentile_approx
语法: percentile_approx(DOUBLE col, array(p1 [, p2]…) [, B])
返回值: array
说明: 功能和上述类似，之后后面可以输入多个百分位数，返回类型也为array，其中为对应的百分位数。

14、直方图: histogram_numeric
语法: histogram_numeric(col, b)
返回值: array<struct {‘x’,‘y’}>
说明: 以b为基准计算col的直方图信息。

select histogram_numeric(100,5) from tableName;
[{"x":100.0,"y":1.0}]

二、lateral view 与 explode以及reflect和窗口函数

lateral view与explode

ateral view用于和split、explode等UDTF一起使用的，能将一行数据拆分成多行数据，在此基础上可以对拆分的数据进行聚合，lateral view首先为原始表的每行调用UDTF，UDTF会把一行拆分成一行或者多行，lateral view在把结果组合，产生一个支持别名表的虚拟表。
其中explode还可以用于将hive一列中复杂的array或者map结构拆分成多行。

数据准备

Marry sing#dance height:165cm#weight:55kg [{"province":"ShangHai","city":"ShangHai"},{"province":"ZheJiang","city":"HangZhou"}]
Bob read#binge-watching height:175cm#weight:65kg [{"province":"JangSu","city":"SuZhou"},{"province":"ShangHai","city":"ShangHai"}]

需求：将所以的hoppy切分成一列，将map的key和value进行拆开key、value列

drop table if exists student_info;
create table if not exists student_info (
stu_name string comment '姓名',
hoppy array<string> comment '爱好',
base_info map<string,string> comment '基本信息',
address string comment '地址'
)
comment '学生信息表'
row format delimited fields terminated by ' '
collection items terminated by '#'
map keys terminated by ':'
stored as textFile;
;load data local inpath '/home/huangwei/input/students_base_info.txt' into table student_info;

explode拆分array

select explode(hoppy) as hoppies from student_info;
+-----------------+
|     hoppies     |
+-----------------+
| sing            |
| dance           |
| read            |
| binge-watching  |
+-----------------+

explode拆分map
如果创建表时未指定array类型

select explode(split(hoppy,'#')) as hoppies from student_info;

select explode(base_info) as (info_key,info_value) from student_info;
+-----------+-------------+
| info_key  | info_value  |
+-----------+-------------+
| height    | 165cm       |
| weight    | 55kg        |
| height    | 175cm       |
| weight    | 65kg        |
+-----------+-------------+

explode拆分json
内置的 UDF 可以看到两个用于解析 Json 的函数：get_json_object 和 json_tuple。

select get_json_object('{"province":"ShangHai","city":"ShangHai"}', '$.province');
+-----------+
|    _c0    |
+-----------+
| ShangHai  |
+-----------+

select json_tuple('{"province":"ShangHai","city":"ShangHai"}', 'province','city');
+-----------+-----------+
|    c0     |    c1     |
+-----------+-----------+
| ShangHai  | ShangHai  |
+-----------+-----------+

json_tuple 相对于 get_json_object 的优势就是一次可以解析多个 Json 字段。但是如果我们有个 Json 数组，这两个函数都无法处理，get_json_object 处理 Json 数组的功能很有限

select get_json_object(explode(split(regexp_replace(regexp_replace(address,'\\[\\{',''),'}]',''),'},\\{')),'$.province') as address_info from student_info;
Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)

select json_tuple(explode(split(regexp_replace(regexp_replace(address,'\\[\\{',''),'}]',''),'},\\{')),'province','city') as address_info from student_info;
Error: Error while compiling statement: FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions (state=42000,code=10081)

配合LATERAL VIEW使用
查询两个字段

select hoppies,address from student_info lateral view explode(hoppy)hoppy as hoppies;
+-----------------+---------------------------------------------------------------------------------------+
|     hoppies     |                      address                       |
+-----------------+---------------------------------------------------------------------------------------+
| sing            | [{"province":"ShangHai","city":'ShangHai'},{"province":"ZheJiang","city":'HangZhou'}] |
| dance           | [{"province":"ShangHai","city":'ShangHai'},{"province":"ZheJiang","city":'HangZhou'}] |
| read            | [{"province":"JangSu","city":'SuZhou'},{"province":"ShangHai","city":'ShangHai'}] |
| binge-watching  | [{"province":"JangSu","city":'SuZhou'},{"province":"ShangHai","city":'ShangHai'}] |
+-----------------+-----------------------------------------------------------------------------------+=

lateral view explode(hoppy)hoppy as hoppies相当于一个虚拟表，与源表student_info进行笛卡尔积关联
解析json

select get_json_object(concat('{',address_info,'}'),'$.province') as province,
get_json_object(concat('{',address_info,'}'),'$.city') as city from student_info
lateral view explode(split(regexp_replace(regexp_replace(address,'\\[\\{',''),'}]',''),'},\\{'))address as address_info;
+-----------+-----------+
| province  |   city    |
+-----------+-----------+
| ShangHai  | ShangHai  |
| ZheJiang  | HangZhou  |
| JangSu    | SuZhou    |
| ShangHai  | ShangHai  |
+-----------+-----------+

select json_tuple(address_info, 'province', 'city') from
(select explode(split(regexp_replace(regexp_replace(address,'\\[|\\]',''),'\\}\\,\\{','\\}\\;\\{'),'\\;')) as address_info from student_info) tmpview ;
+-----------+-----------+
|    c0     |    c1     |
+-----------+-----------+
| ShangHai  | ShangHai  |
| ZheJiang  | HangZhou  |
| JangSu    | SuZhou    |
| ShangHai  | ShangHai  |
+-----------+-----------+

explode 函数只能接收数组或 map 类型的数据，而 split 函数生成的结果就是数组
第一个 regexp_replace 的作用是将 Json 数组两边的中括号去掉
第二个 regexp_replace 的作用是将 Json 数组元素之间的逗号换成分号

自定义UDF解析JSON数组

import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDF;
import org.json.JSONArray;
import org.json.JSONException;import java.util.ArrayList;@Description(name = "json_array",value = "_FUNC_(array_string) - Convert a string of a JSON-encoded array to a Hive array of strings.")
public class UDFJsonAsArray extends UDF {public ArrayList<String> evaluate(String jsonString) {if (jsonString == null) {return null;}try {JSONArray extractObject = new JSONArray(jsonString);ArrayList<String> result = new ArrayList<String>();for (int ii = 0; ii < extractObject.length(); ++ii) {result.add(extractObject.get(ii).toString());}return result;} catch (JSONException e) {return null;} catch (NumberFormatException e) {return null;}}
}

列转行

行转列

数据准备：

F,13 LiSa|Mali
M,15    Marry|Bob
M,17    KangKang|Lon

创建表格加载数据

drop table if exists people_base;
create table if not exists people_base (
base string,
name_list string
)
row format delimited fields terminated by '\t'
;load data local inpath '/home/huangwei/input/people_base.txt' into table people_base;

需求：将行中姓名拆分开

select tmp_view.base_info[0] as gender,tmp_view.base_info[1] as age,name from (select name_list,split(base,',') as base_info from people_base) tmp_view lateral view explode(split(name_list,'\\|')) test as name;
+---------+------+-----------+
| gender  | age  |   name    |
+---------+------+-----------+
| F       | 13   | LiSa      |
| F       | 13   | Mali      |
| M       | 15   | Marry     |
| M       | 15   | Bob       |
| M       | 17   | KangKang  |
| M       | 17   | Lon       |
+---------+------+-----------+

reflect函数

reflect函数可以支持在sql中调用java中的自带函数，秒杀一切udf函数。

select reflect('java.lang.Math','max',gender) from people_info;

窗口函数

hive当中也带有很多的窗口函数以及分析函数，主要用于以下这些场景

（1）用于分区排序
（2）动态Group By
（3）Top N
（4）累计计算
（5）层次查询
数据准备：

zhangsan,1,new,67.1,2
lisi,2,old,43.32,1
wagner,3,new,88.88,3
liliu,4,new,66.0,1
qiuba,5,new,54.32,1
wangshi,6,old,77.77,2
liwei,7,old,88.44,3
wutong,8,new,56.55,6
lilisi,9,new,88.88,5
qishili,10,new,66.66,5

创建表格加载数据：

drop table if exists order_detail;
create table if not exists order_detail (
user_id string,
device_id string,
user_type string,
price double,
sales int
)
row format delimited fields terminated by ','
;load data local inpath '/home/huangwei/input/order_info.txt' into table order_detail;

窗口函数

FIRST_VALUE：取分组内排序后，截止到当前行，第一个值
LAST_VALUE：取分组内排序后，截止到当前行，最后一个值
LEAD(col,n,DEFAULT) ：用于统计窗口内往下第n行值。第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL）
LAG(col,n,DEFAULT) ：与lead相反，用于统计窗口内往上第n行值。第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）

OVER语句

1、使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG

select user_id,sum(sales) over() from order_detail;
+-----------+---------------+
|  user_id  | sum_window_0  |
+-----------+---------------+
| qishili   | 29            |
| lilisi    | 29            |
| wutong    | 29            |
| liwei     | 29            |
| wangshi   | 29            |
| qiuba     | 29            |
| liliu     | 29            |
| wagner    | 29            |
| lisi      | 29            |
| zhangsan  | 29            |
+-----------+---------------+

2、使用PARTITION BY语句，使用一个或者多个原始数据类型的列
也叫查询分区字句，而over()之前的函数在每个分组内执行

select user_id,user_type,sum(sales) over(partition by user_type) from order_detail;
+-----------+------------+---------------+
|  user_id  | user_type  | sum_window_0  |
+-----------+------------+---------------+
| qishili   | new        | 23            |
| lilisi    | new        | 23            |
| wutong    | new        | 23            |
| qiuba     | new        | 23            |
| liliu     | new        | 23            |
| wagner    | new        | 23            |
| zhangsan  | new        | 23            |
| liwei     | old        | 6             |
| wangshi   | old        | 6             |
| lisi      | old        | 6             |
+-----------+------------+---------------+

3、使用PARTITION BY与ORDER BY语句，使用一个或者多个数据类型的分区或者排序列

select user_id,user_type,sales,sum(sales) over(partition by user_type order by sales) from order_detail;
+-----------+------------+--------+---------------+
|  user_id  | user_type  | sales  | sum_window_0  |
+-----------+------------+--------+---------------+
| qiuba     | new        | 1      | 2             |
| liliu     | new        | 1      | 2             |
| zhangsan  | new        | 2      | 4             |
| wagner    | new        | 3      | 7             |
| qishili   | new        | 5      | 17            |
| lilisi    | new        | 5      | 17            |
| wutong    | new        | 6      | 23            |
| lisi      | old        | 1      | 1             |
| wangshi   | old        | 2      | 3             |
| liwei     | old        | 3      | 6             |
+-----------+------------+--------+---------------+

4、使用窗口规范，窗口规范支持以下格式：

(ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)

(ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
当ORDER BY后面缺少窗口从句条件，窗口规范默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
OVER从句支持以下函数，但是并不支持和窗口一起使用它们。

Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank.

PRECEDING：往前
FOLLOWING：往后
CURRENT ROW：当前行
UNBOUNDED：起点（UNBOUNDED PRECEDING表示从前面的起点，UNBOUNDED PRECEDING表示到后面的终点）

select user_id,user_type,sum(sales) over() sample1,--所有行累加
sum(sales) over(partition by user_type) sample2,--按照user_type相加
sum(sales) over(partition by user_type order by sales) sample3,--按照uesr_type累加
sum(sales) over(partition by user_type order by sales rows between unbounded preceding and current row) sample4,
sum(sales) over(partition by user_type order by sales rows between 1 preceding and current row) sample5,-- 当前行和上一行相加
sum(sales) over(partition by user_type order by sales rows between 1 preceding and 1 following) sample6,-- 上一行、当前行、后一行相加
sum(sales) over(partition by user_type order by sales rows between current row and unbounded following) sample7
from order_detail;-- 当前行到末尾
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+
|  user_id  | user_type  | sales  | sample1  | sample2  | sample3  | sample4  | sample5  | sample6  | sample7  |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+
| qiuba     | new        | 1      | 29       | 23       | 2        | 1        | 1        | 2        | 23       |
| liliu     | new        | 1      | 29       | 23       | 2        | 2        | 2        | 4        | 22       |
| zhangsan  | new        | 2      | 29       | 23       | 4        | 4        | 3        | 6        | 21       |
| wagner    | new        | 3      | 29       | 23       | 7        | 7        | 5        | 10       | 19       |
| qishili   | new        | 5      | 29       | 23       | 17       | 12       | 8        | 13       | 16       |
| lilisi    | new        | 5      | 29       | 23       | 17       | 17       | 10       | 16       | 11       |
| wutong    | new        | 6      | 29       | 23       | 23       | 23       | 11       | 11       | 6        |
| lisi      | old        | 1      | 29       | 6        | 1        | 1        | 1        | 3        | 6        |
| wangshi   | old        | 2      | 29       | 6        | 3        | 3        | 3        | 6        | 5        |
| liwei     | old        | 3      | 29       | 6        | 6        | 6        | 5        | 5        | 3        |
+-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+

5、row_number()和rank()和dense_rank()
row_number()是无脑排序
rank() 相同的值排名相同，接下来的排名会加
dense_rank()相同的值排名相同，接下来的排名不会加

select user_id,user_type,sales,
row_number() over(partition by user_type order by sales),
rank() over(partition by user_type order by sales),
dense_rank() over(partition by user_type order by sales)
from order_detail;
+-----------+------------+--------+----------------------+----------------+----------------------+
|  user_id  | user_type  | sales  | row_number_window_0  | rank_window_1  | dense_rank_window_2  |
+-----------+------------+--------+----------------------+----------------+----------------------+
| qiuba     | new        | 1      | 1                    | 1              | 1                    |
| liliu     | new        | 1      | 2                    | 1              | 1                    |
| zhangsan  | new        | 2      | 3                    | 3              | 2                    |
| wagner    | new        | 3      | 4                    | 4              | 3                    |
| qishili   | new        | 5      | 5                    | 5              | 4                    |
| lilisi    | new        | 5      | 6                    | 5              | 4                    |
| wutong    | new        | 6      | 7                    | 7              | 5                    |
| lisi      | old        | 1      | 1                    | 1              | 1                    |
| wangshi   | old        | 2      | 2                    | 2              | 2                    |
| liwei     | old        | 3      | 3                    | 3              | 3                    |
+-----------+------------+--------+----------------------+----------------+----------------------+

6、Lead 和 Lag 函数
lag()函数是在窗口内，在指定列上，取上N行，并且有默认值，没有默认值的为NULL
第一参数为列名，第二个参数为取上多少行，第三个参数为默认值
lead()向下取

select user_id,user_type,sales,
lag(user_id,3,'userID') over(partition by user_type order by sales),
lead(user_id,3,'userID') over(partition by user_type order by sales)
from order_detail;
+-----------+------------+--------+---------------+----------------+
|  user_id  | user_type  | sales  | lag_window_0  | lead_window_1  |
+-----------+------------+--------+---------------+----------------+
| qiuba     | new        | 1      | userID        | wagner         |
| liliu     | new        | 1      | userID        | qishili        |
| zhangsan  | new        | 2      | userID        | lilisi         |
| wagner    | new        | 3      | qiuba         | wutong         |
| qishili   | new        | 5      | liliu         | userID         |
| lilisi    | new        | 5      | zhangsan      | userID         |
| wutong    | new        | 6      | wagner        | userID         |
| lisi      | old        | 1      | userID        | userID         |
| wangshi   | old        | 2      | userID        | userID         |
| liwei     | old        | 3      | userID        | userID         |
+-----------+------------+--------+---------------+----------------+

7、first_value()和last_value()
first()份分区第一个值，last_value()分区最后一个值。

select user_id,user_type,sales,
first_value(user_id) over(partition by user_type order by sales),
last_value(user_id) over(partition by user_type order by sales)
from order_detail;
+-----------+------------+--------+-----------------------+----------------------+
|  user_id  | user_type  | sales  | first_value_window_0  | last_value_window_1  |
+-----------+------------+--------+-----------------------+----------------------+
| qiuba     | new        | 1      | qiuba                 | liliu                |
| liliu     | new        | 1      | qiuba                 | liliu                |
| zhangsan  | new        | 2      | qiuba                 | zhangsan             |
| wagner    | new        | 3      | qiuba                 | wagner               |
| qishili   | new        | 5      | qiuba                 | lilisi               |
| lilisi    | new        | 5      | qiuba                 | lilisi               |
| wutong    | new        | 6      | qiuba                 | wutong               |
| lisi      | old        | 1      | lisi                  | lisi                 |
| wangshi   | old        | 2      | lisi                  | wangshi              |
| liwei     | old        | 3      | lisi                  | liwei                |
+-----------+------------+--------+-----------------------+----------------------+

三、自定义函数

为什么要自定义函数？
有时候 hive 自带的函数不能满足当前需要,需要自定义函数来解决问题

UDF、UDAF、UDTF比较

UDF 操作作用于单个数据行,并且产生一个数据行作为输出。大多数函数都属于这一类(比如数学函数和字符串函数)。返回对应值，一对一。
UDAF 接受多个输入数据行,并产生一个输出数据行。像 COUNT 和 MAX 这样的函数就是聚集函数。返回聚合值，多对一。
UDTF 操作作用于单个数据行,并且产生多个数据行,一个表作为输出。lateral view explore()，返回拆分值，一对多。

准本数据

2,1,2109,Marry,13,sing#dance
3,5,3507,Lili,14,run#shopping
1,9,1915,Bob,12,read#binge-watching

创建表格

create table if not exists students (
grade_id string comment '年级',
class_id string comment '班级',
stu_id string comment '学号',
stu_name string comment '姓名',
stu_age string comment '年龄',
hoppy string comment '爱好'
)
comment '学生信息表'
row format delimited fields terminated by ',' lines terminated by '\n'
;
-- load数据
load data local inpath '/home/huangwei/input/students.txt' into table students;

pom文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>org.example</groupId><artifactId>hive-udf</artifactId><version>1.0-SNAPSHOT</version><properties><maven.compiler.source>1.7</maven.compiler.source><maven.compiler.target>1.7</maven.compiler.target><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><hadoop.version>2.7.3</hadoop.version><hive.version>2.3.6</hive.version></properties><dependencies><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.7.3</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>2.7.3</version></dependency><dependency><groupId>org.apache.hive</groupId><artifactId>hive-exec</artifactId><version>2.3.6</version></dependency><dependency><groupId>org.apache.hive</groupId><artifactId>hive-jdbc</artifactId><version>2.3.6</version></dependency><dependency><groupId>log4j</groupId><artifactId>log4j</artifactId><version>1.2.17</version></dependency></dependencies><build><finalName>hive-udf</finalName><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-compiler-plugin</artifactId><configuration><source>1.8</source><target>1.8</target></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-assembly-plugin</artifactId><configuration><archive><manifest><mainClass></mainClass></manifest></archive><descriptorRefs><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs></configuration><!-- 添加此项后，可直接使用mvn package | mvn install --><!-- 不添加此项，需直接使用mvn package assembly:single --><executions><execution><id>make-assembly</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build>
</project>

自定义UDF

定义一个UDAF需要如下步骤:

继承org.apache.hadoop.hive.ql.exec.UDF
重写evaluate()，这个方法不是由接口定义的,因为它可接受的参数的个数,数据类型都是不确定的。Hive会检查UDF,看能否找到和函数调用相匹配的evaluate()方法

需求：BASE64加密解密

自定义标准函数需要继承实现抽象类org.apache.hadoop.hive.ql.udf.generic.GenericUDF
BASE64加密UDF

package udf;import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import sun.misc.BASE64Encoder;import java.io.UnsupportedEncodingException;/*** @Author: H.w* @Date: 2021/1/7 下午3:45* @Description: Base64加密UDF**/
public class Base64Encrypt extends UDF {public String evaluate(String message) throws Exception {//判断传进来的参数是否为空if(StringUtils.isBlank(message)){return "";}//base64 加密byte[] bt = null;String newMsg = null;try {bt = message.getBytes("utf-8");} catch (UnsupportedEncodingException e) {e.printStackTrace();}if( bt != null){newMsg = new BASE64Encoder().encode(bt);}if(newMsg.contains("\r\n")){newMsg = newMsg.replace("\r\n","");}else if(newMsg.contains("\r")){newMsg = newMsg.replace("\r","");}else if(newMsg.contains("\n")){newMsg = newMsg.replace("\n","");}return newMsg;}
}

BASE64解密UDF

package udf;import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.hive.ql.exec.UDF;
import sun.misc.BASE64Decoder;/*** @Author: H.w* @Date: 2021/1/7 下午3:56* @Description: Base64解压UDF**/
public class Base64Decrypt extends UDF {public String evaluate(String msg) throws Exception {//判断传进来的参数是否为空if (StringUtils.isBlank(msg)) {return "";}//base64 解密byte[] bt = null;String result = null;if (msg != null) {BASE64Decoder decoder = new BASE64Decoder();try {bt = decoder.decodeBuffer(msg);result = new String(bt, "utf-8");} catch (Exception e) {e.printStackTrace();}}return result;}
}

自定义 UDAF

用户自定义聚合函数。user defined aggregate function。多对一的输入输出 count sum max。定义一个UDAF需要如下步骤:

UDF自定义函数必须是org.apache.hadoop.hive.ql.exec.UDAF的子类,并且包含一个或多个嵌套的的实现了org.apache.hadoop.hive.ql.exec.UDAFEvaluator的静态类。
函数类需要继承UDAF类，内部类Evaluator实现UDAFEvaluator接口。
Evaluator需要实现 init、iterate、terminatePartial、merge、terminate这几个函

函数	说明
init	实现接口UDAFEvaluator的init函数
iterate	每次对一个新值进行聚集计算都会调用,计算函数要根据计算的结果更新其内部状态
terminatePartial	无参数，其为iterate函数轮转结束后，返回轮转数据
merge	接收terminatePartial的返回结果，进行数据merge操作，其返回类型为boolean
terminate	返回最终的聚集函数结果。

需求：求列平均值

package udaf;import org.apache.hadoop.hive.ql.udf.generic.AbstractGenericUDAFResolver;
import com.sun.org.apache.commons.logging.Log;
import com.sun.org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.parse.SemanticException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
import org.apache.hadoop.hive.serde2.objectinspector.*;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.DoubleObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.LongObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo;
import org.apache.hadoop.hive.serde2.typeinfo.TypeInfo;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.util.StringUtils;import java.util.ArrayList;/*** @Author: H.w* @Date: 2021/1/8 上午10:13* @Description: 求某一列的平均值**/
public class GenericUDAFAverage extends AbstractGenericUDAFResolver {static final Log log = LogFactory.getLog(GenericUDAFAverage.class.getName());/*** 读入参数类型校验，满足条件时，返回聚合函数处理对象* @param info* @return* @throws SemanticException*/@Overridepublic GenericUDAFEvaluator getEvaluator(TypeInfo[] info) throws SemanticException {if (info.length !=1) {throw new UDFArgumentTypeException(info.length -1, "Exactly one argument is expected.");}/** hive 使用 ObjectInspector来分析行对象的内部结构以及各个列的结构*/if (info[0].getCategory() != ObjectInspector.Category.PRIMITIVE) {throw new UDFArgumentTypeException(0,  "Only primitive type arguments are accepted but " + info[0].getTypeName() + " is passed.");}// PrimitiveCategory枚举类型switch (((PrimitiveTypeInfo) info[0]).getPrimitiveCategory()) {case BYTE:case SHORT:case INT:case LONG:case FLOAT:case DOUBLE:case STRING:case TIMESTAMP:return new GenericUDAFAverageEvaluator();case BOOLEAN:default:throw new UDFArgumentTypeException(0, "Only numeric or string type arguments are accepted but " + info[0].getTypeName() + " is passed.");}}/*** GenericUDAFAverageEvaluator.* 自定义静态内部类：数据处理类，继承GenericUDAFEvaluator抽象类*/public static class GenericUDAFAverageEvaluator extends GenericUDAFEvaluator {//1.1.定义全局输入输出数据的类型OI实例，用于解析输入输出数据// input For PARTIAL1 and COMPLETEPrimitiveObjectInspector inputOI;// input For PARTIAL2 and FINAL// output For PARTIAL1 and PARTIAL2StructObjectInspector soi;StructField countField;StructField sumField;LongObjectInspector countFieldOI;DoubleObjectInspector sumFieldOI;//1.2.定义全局输出数据的类型，用于存储实际数据// output For PARTIAL1 and PARTIAL2Object[] partialResult;// output For FINAL and COMPLETEDoubleWritable result;/** 初始化：对各个模式处理过程，提取输入数据类型OI，返回输出数据类型OI* .每个模式（Mode）都会执行初始化* 1.输入参数parameters：* .1.1.对于PARTIAL1 和COMPLETE模式来说，是原始数据（单值）*    .设定了iterate()方法的输入参数的类型OI为：*    .         PrimitiveObjectInspector 的实现类 WritableDoubleObjectInspector 的实例*    .       通过输入OI实例解析输入参数值* .1.2.对于PARTIAL2 和FINAL模式来说，是模式聚合数据（双值）*    .设定了merge()方法的输入参数的类型OI为：*    .         StructObjectInspector 的实现类 StandardStructObjectInspector 的实例*    .      通过输入OI实例解析输入参数值* 2.返回值OI：* .2.1.对于PARTIAL1 和PARTIAL2模式来说，是设定了方法terminatePartial()返回值的OI实例*    .输出OI为 StructObjectInspector 的实现类 StandardStructObjectInspector 的实例* .2.2.对于FINAL 和COMPLETE模式来说，是设定了方法terminate()返回值的OI实例*    .输出OI为 PrimitiveObjectInspector 的实现类 WritableDoubleObjectInspector 的实例*/@Overridepublic ObjectInspector init(Mode mode, ObjectInspector[] parameters) throws HiveException {assert (parameters.length == 1);super.init(mode, parameters);// init inputif (mode == Mode.PARTIAL1 || mode == Mode.COMPLETE) {inputOI = (PrimitiveObjectInspector) parameters[0];} else {//部分数据作为输入参数时，用到的struct的OI实例，指定输入数据类型，用于解析数据soi = (StructObjectInspector) parameters[0];countField = soi.getStructFieldRef("count");sumField = soi.getStructFieldRef("sum");//数组中的每个数据，需要其各自的基本类型OI实例解析countFieldOI = (LongObjectInspector) countField.getFieldObjectInspector();sumFieldOI = (DoubleObjectInspector) sumField.getFieldObjectInspector();}// init outputif (mode == Mode.PARTIAL1 || mode == Mode.PARTIAL2) {// The output of a partial aggregation is a struct containing// a "long" count and a "double" sum.//部分聚合结果是一个数组partialResult = new Object[2];partialResult[0] = new LongWritable(0);partialResult[1] = new DoubleWritable(0);/** 构造Struct的OI实例，用于设定聚合结果数组的类型* 需要字段名List和字段类型List作为参数来构造*/ArrayList<String> fname = new ArrayList<String>();fname.add("count");fname.add("sum");ArrayList<ObjectInspector> foi = new ArrayList<ObjectInspector>();//注：此处的两个OI类型 描述的是 partialResult[] 的两个类型，故需一致foi.add(PrimitiveObjectInspectorFactory.writableLongObjectInspector);foi.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);return ObjectInspectorFactory.getStandardStructObjectInspector(fname, foi);} else {//FINAL 最终聚合结果为一个数值，并用基本类型OI设定其类型result = new DoubleWritable(0);return PrimitiveObjectInspectorFactory.writableDoubleObjectInspector;}}/** 聚合数据缓存存储结构*/static class AverageAgg implements AggregationBuffer {long count;double sum;};@Overridepublic AggregationBuffer getNewAggregationBuffer() throws HiveException {AverageAgg result = new AverageAgg();reset(result);return result;}@Overridepublic void reset(AggregationBuffer agg) throws HiveException {AverageAgg myagg = (AverageAgg) agg;myagg.count = 0;myagg.sum = 0;}boolean warned = false;/** 遍历原始数据*/@Overridepublic void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException {assert (parameters.length == 1);Object p = parameters[0];if (p != null) {AverageAgg myagg = (AverageAgg) agg;try {//通过基本数据类型OI解析Object p的值double v = PrimitiveObjectInspectorUtils.getDouble(p, inputOI);myagg.count++;myagg.sum += v;} catch (NumberFormatException e) {if (!warned) {warned = true;log.warn(getClass().getSimpleName() + " "+ StringUtils.stringifyException(e));log.warn(getClass().getSimpleName()+ " ignoring similar exceptions.");}}}}/** 得出部分聚合结果*/@Overridepublic Object terminatePartial(AggregationBuffer agg) throws HiveException {AverageAgg myagg = (AverageAgg) agg;((LongWritable) partialResult[0]).set(myagg.count);((DoubleWritable) partialResult[1]).set(myagg.sum);return partialResult;}/** 合并部分聚合结果* 注：Object[] 是 Object 的子类，此处 partial 为 Object[]数组*/@Overridepublic void merge(AggregationBuffer agg, Object partial) throws HiveException {if (partial != null) {AverageAgg myagg = (AverageAgg) agg;//通过StandardStructObjectInspector实例，分解出 partial 数组元素值Object partialCount = soi.getStructFieldData(partial, countField);Object partialSum = soi.getStructFieldData(partial, sumField);//通过基本数据类型的OI实例解析Object的值myagg.count += countFieldOI.get(partialCount);myagg.sum += sumFieldOI.get(partialSum);}}/** 得出最终聚合结果*/@Overridepublic Object terminate(AggregationBuffer agg) throws HiveException {AverageAgg myagg = (AverageAgg) agg;if (myagg.count == 0) {return null;} else {result.set(myagg.sum / myagg.count);return result;}}}
}

自定义UDTF

UDTF(User-Defined Table-Generating Function):用户定义表生成函数，用来解决输入一行，输出多行的场景。

编写UDTF需要继承GenericUDTF类,然后重写initialize方法和process方法和close方法

函数	说明
initialize	初始化返回的列和返回的列类型
process	对输入的每一行进行操作, 他通过调用forward()返回一行或者多行数据
close	在process方法结束后调用，用于进行一些其他的操作，只执行一次

需求：一列拆分成多列

package udtf;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import java.util.ArrayList;/*** @Author: H.w* @Date: 2021/1/8 上午11:35* @Description: UDTF**/
public class GenericUDTFColumns extends GenericUDTF {@Overridepublic StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {// 设置新的列名ArrayList<String> columns = new ArrayList<>();// 由几列就添加几列columns.add("hoppy1");columns.add("hoppy2");ArrayList<ObjectInspector> columnType = new ArrayList<ObjectInspector>();columnType.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);columnType.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);return ObjectInspectorFactory.getStandardStructObjectInspector(columns, columnType);}@Overridepublic void process(Object[] objects) throws HiveException {forward(objects[0].toString().split(","));}@Overridepublic void close() throws HiveException {}
}

创建临时函数

打包

将jar添加到hive服务器

hive> add jar /home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar;
Added [/home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [/home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar]

注册临时函数

hive> create temporary function base_en as 'udf.Base64Encrypt';
OK
Time taken: 0.281 seconds
hive> create temporary function base_de as 'udf.Base64Decrypt';
OK
Time taken: 0.004 seconds
hive> create temporary function generic_avg as 'udaf.GenericUDAFAverage';
OK
Time taken: 0.052 seconds
hive> create temporary function generic_col as 'udtf.GenericUDTFColumns';
OK
Time taken: 0.003 seconds

注意：这种只是创建临时函数，重启Hive就会时效

测试UDF

hive> select base_en('hello');
OK
aGVsbG8=
Time taken: 0.462 seconds, Fetched: 1 row(s)
hive> select base_de('aGVsbG8=');
OK
hello
Time taken: 0.078 seconds, Fetched: 1 row(s)

测试UDAF

hive> select generic_avg(stu_age) from students;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210108155656_c7f018f3-c163-410a-9784-5b1c3e0f6913
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1610070707037_0001, Tracking URL = http://localhost:8088/proxy/application_1610070707037_0001/
Kill Command = /usr/local/hadoop-2.7.3/bin/hadoop job  -kill job_1610070707037_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-01-08 15:57:04,666 Stage-1 map = 0%,  reduce = 0%
2021-01-08 15:57:07,834 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.02 sec
2021-01-08 15:57:12,983 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.04 sec
MapReduce Total cumulative CPU time: 3 seconds 40 msec
Ended Job = job_1610070707037_0001
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.04 sec   HDFS Read: 8783 HDFS Write: 104 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 40 msec
OK
13.0
Time taken: 17.124 seconds, Fetched: 1 row(s)

测试UDTF

hive> select generic_col(hoppy) from students;
OK
sing    dance
run shopping
read    binge-watching
Time taken: 1.017 seconds, Fetched: 3 row(s)

创建永久函数

将jar上传到指定位置

[root@localhost lib]# hdfs dfs -mkdir -p /hive/lib;
[root@localhost lib]# hdfs dfs -put /home/huangwei/IdeaProjects/hive-udf/target/hive-udf-jar-with-dependencies.jar /hive/lib

创建永久生效的UDF函数

hive> create function base64en as 'udf.Base64Encrypt' using jar 'hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar';
Added [/opt/hive/tmp/217fe48a-a64c-42d4-8696-e07c57e6ee13_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
Time taken: 0.371 seconds
hive> create function base64de as 'udf.Base64Decrypt' using jar 'hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar';
Added [/opt/hive/tmp/689060e4-4c5e-4126-a685-c9d008ea75cb_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
Time taken: 2.665 seconds

检测：

hive> select base64en('hello');
Added [/opt/hive/tmp/580da8d8-3cfa-4b2f-a862-03c37fc66ff4_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
aGVsbG8=
Time taken: 2.86 seconds, Fetched: 1 row(s)
hive> select base64de('aGVsbG8=');
Added [/opt/hive/tmp/580da8d8-3cfa-4b2f-a862-03c37fc66ff4_resources/hive-udf-jar-with-dependencies.jar] to class path
Added resources: [hdfs://localhost:9000/hive/lib/hive-udf-jar-with-dependencies.jar]
OK
hello
Time taken: 0.141 seconds, Fetched: 1 row(s)

检查MySQL中元数据

mysql> SELECT * FROM FUNCS;
+---------+-------------------+-------------+-------+-----------+-----------+------------+------------+
| FUNC_ID | CLASS_NAME        | CREATE_TIME | DB_ID | FUNC_NAME | FUNC_TYPE | OWNER_NAME | OWNER_TYPE |
+---------+-------------------+-------------+-------+-----------+-----------+------------+------------+
|       1 | udf.Base64Encrypt |  1610011833 |     1 | base64en  |         1 | NULL       | USER       |
|       6 | udf.Base64Decrypt |  1610011914 |     1 | base64de  |         1 | NULL       | USER       |
+---------+-------------------+-------------+-------+-----------+-----------+------------+------------+
2 rows in set (0.00 sec)

可以看到函数的信息已经注册到元数据中了。

Hive系列（四）函数相关推荐

辛巴学院-Unity-剑英陪你零基础学c#系列(四)函数和封装
辛巴学院:正大光明的不务正业. 国庆长假结束了,我的心情是这样的: 你总是起不早,起不早独自一个人沉睡到天亮你无怨无悔的梦着那副本我知道你根本就不想上班你总是起不早,起不早放假总是短暂,上班 ...
Hive系列 (一)：Hive搭建
文章目录 Hive系列文章一.环境信息二.hive下载安装三.mysql下载安装及配置四.Hive配置五.启动服务六.beeline配置七.beeline一键启动脚本八.注意事项 Hi ...
Hive系列 (六)：Hive数据类型转换
文章目录 Hive系列文章数据类型转换 Cast显示转换数据类型转换表日期类型转换说明转换示例 Hive系列文章 Hadoop完全分布式搭建(腾讯云服务器+阿里云服务器) Hive系列 (一) ...
大数据入门教程系列之Hive内置函数及自定义函数
本篇文章主要介绍Hive内置函数以及自定义UDF函数和UDFT函数,自定义UDF函数通过一个国际转换中文的例子说明. 操作步骤: ①.准备数据和环境 ②.演示Hive内置函数 ③.自定义UDF函数编写 ...
大数据基础之Hive（四）—— 常用函数和压缩存储
作者:duktig 博客:https://duktig.cn (文章首发) 优秀还努力.愿你付出甘之如饴,所得归于欢喜. 更多文章参看github知识库:https://github.com/dukt ...
SQL Server 2008空间数据应用系列四：基础空间对象与函数应用
SQL Server 2008空间数据应用系列四:基础空间对象与函数应用原文:SQL Server 2008空间数据应用系列四:基础空间对象与函数应用友情提示,您阅读本篇博文的先决条件如下: 1. ...
hive 如果表不存在则创建_从零开始学习大数据系列(四十七) Hive中数据的加载与导出...
[本文大约1400字,阅读时间5~10分钟] 在<从零开始学习大数据系列(三十八) Hive中的数据库和表>和<从零开始学习大数据系列(四十二)Hive中的分区>文章中,我们已 ...
Hive学习之路(四):Hive内置函数介绍与实现WordCount
内容简介一.Hive内置函数介绍二.Hive常用内置函数介绍 1.数值计算函数 2.字符串操作函数 3.日期函数 4.聚合函数 5.表生成函数三.使用Hive函数完成WordCount 1.创建 ...
2021年大数据Hive（七）：Hive的开窗函数
全网最详细的Hive文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录系列历史文章前言 Hive的开窗函数一.窗口函数 ROW_NUMBER,RANK ...
2021年大数据Hive（四）：Hive查询语法
全网最详细的Hive文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录系列历史文章前言 hive查询语法一.SELECT语句 1.语句结构 2.全表查 ...

Hive系列（四）函数

Hive函数

一、常用内置函数

内置函数查看

关系运算

数学运算

逻辑运算

数值计算

日期函数

条件函数

字符串函数

集合统计函数

二、lateral view 与 explode以及reflect和窗口函数

lateral view与explode

列转行

行转列

reflect函数

窗口函数

三、自定义函数

UDF、UDAF、UDTF比较

自定义UDF

自定义 UDAF

自定义UDTF

创建临时函数

创建永久函数

Hive系列（四）函数相关推荐

最新文章

热门文章