大数据开发之Hive篇12-Hive正则表达式
备注:
Hive 版本 2.1.1
文章目录
- 一.Hive 正则表达式概述
- 1.1 字符集合:
- 1.2 边界集合:
- 1.3 重复次数集合:
- 1.4 组合操作符:
- 1.5 匹配操作符:
- 1.6 转义操作符:
- 二.Hive 正则表达式案例
- 2.1 regexp
- 2.2 regexp_replace
- 2.2.1 截取字符串中汉字部分
- 2.2.2 截取字符串中数字部分
- 2.2.3 截取字符串中字母部分
- 2.2.4 截取字符串中的大小写字母和数字
- 2.3 regexp_extract
- 参考:
一.Hive 正则表达式概述
Hive的正则表达式虽然没有关系型数据库的正则表达式那么强大,但是一样可以解决HQL开发过程中的诸多问题,数据工作者平时也离不开正则表达式。对此,特意做了个hive正则表达式的小结。所有代码都经过亲测,正常运行。
Hive支持如下三个正则表达式:
- regexp
- regexp_extract
- regexp_replace
1.1 字符集合:
字符 | 匹配的字符 |
---|---|
\d | 从0-9的任一数字 |
\D | 任一非数字字符 |
\w | 任一单词字符,包括A-Z,a-z,0-9和下划线 |
\W | 任一非单词字符 |
\s | 任一空白字符,包括制表符,换行符,回车符,换页符和垂直制表符 |
\S | 任一非空白字符 |
. | 任一字符 |
字符簇:
[[:alpha:]] 任何字母。[[:digit:]] 任何数字。[[:alnum:]] 任何字母和数字。[[:space:]] 任何空白字符。[[:upper:]] 任何大写字母。[[:lower:]] 任何小写字母。[[:punct:]] 任何标点符号。[[:xdigit:]] 任何16进制的数字,相当于[0-9a-fA-F]。
1.2 边界集合:
字符 l | 描述 |
---|---|
^ | 每一行的开头,单行模式下等价于字符串的开头 |
$ | 每一行的结尾,单行模式下等价于字符串的结尾 |
1.3 重复次数集合:
贪婪模式会获取尽可能多的字符,而非贪婪模式会获取尽可能少的字符。
贪婪 | 非贪婪 | 描述 |
---|---|---|
* | *? | 零次或多次 |
? | ?? | 零次或一次 |
+ | +? | 一次或多次 |
{m} | {m}? | 正好m次,贪婪与非贪婪一样的 |
{m,} | {m,}? | 至少m次 |
{m, n} | {m, n}? | 最少m最多n次 |
1.4 组合操作符:
优先级比较:圆括号>重复次数操作符>和>或
组合操作符 | 描述 |
---|---|
[…] | 方括号内任意字符或字符集合中的一个 |
[^…] | 方括号内^为第一个字符时,表示与其后所有字符都不匹配的字符 |
(…) | 圆括号,将复杂表达式当作单一表达式来处理 |
…|… | 或 |
abc | 和。直接将字符连在一起写 |
1.5 匹配操作符:
匹配操作符 | 描述 |
---|---|
\n | 即后向引用。n为1~9,标识由圆括号里取得的匹配字符串。方向是从左到右在regexp_replace函数中,允许在模式表达式和替换表达式中都使用\n |
1.6 转义操作符:
转义操作符 | 描述 |
---|---|
\ | 将其后紧跟着的操作字符当作普通字符看待。例如 abcdef 可以匹配 abdef或abcccdef等,但无法匹配 abcdef,后者需要abc*def才能匹配 |
二.Hive 正则表达式案例
2.1 regexp
语法:
A REGEXP B
功能同RLIKE
如果字符串A或者字符串B为NULL,则返回NULL;如果字符串A符合表达式B 的正则语法,则为TRUE;否则为FALSE。B中字符”_”表示任意单个字符,而字符”%”表示任意数量的字符。
代码:
-- 匹配有8个连续数字的字符串
with tmp1 as
(
select '11145678abc' as rn
union all
select '111456789abc'
union all
select 'd111456789abc'
)
select rnfrom tmp1where rn regexp '\\d{8}';-- 匹配开头有8个及以上连续数字的字符
with tmp1 as
(
select '11145678abc' as rn
union all
select '111456789abc'
union all
select 'd111456789abc'
)
select rnfrom tmp1where rn regexp '^\\d{8}';-- 匹配开头只有8个连续数字的字符
with tmp1 as
(
select '11145678abc' as rn
union all
select '111456789abc'
union all
select 'd111456789abc'
)
select rnfrom tmp1where rn regexp '^\\d{8}\\D';
测试记录:
hive> > -- 匹配有8个连续数字的字符串> with tmp1 as> (> select '11145678abc' as rn> union all> select '111456789abc'> union all> select 'd111456789abc'> )> select rn> from tmp1> where rn regexp '\\d{8}';
Query ID = root_20201217151846_3f0eebcd-6f5a-455d-b8c9-afd5cddbc358
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0295, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0295/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0295
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-17 15:18:52,063 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:18:58,242 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.73 sec
MapReduce Total cumulative CPU time: 1 seconds 730 msec
Ended Job = job_1606698967173_0295
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 1.73 sec HDFS Read: 5150 HDFS Write: 162 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 730 msec
OK
rn
11145678abc
111456789abc
d111456789abc
Time taken: 13.291 seconds, Fetched: 3 row(s)
hive> -- 匹配开头有8个及以上连续数字的字符
hive> with tmp1 as> (> select '11145678abc' as rn> union all> select '111456789abc'> union all> select 'd111456789abc'> )> select rn> from tmp1> where rn regexp '^\\d{8}';
Query ID = root_20201217151946_c5102f51-5e70-4a80-afa6-3678f92091f0
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0296, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0296/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0296
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:19:54,003 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:20:00,174 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.85 sec
MapReduce Total cumulative CPU time: 3 seconds 850 msec
Ended Job = job_1606698967173_0296
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 3.85 sec HDFS Read: 10856 HDFS Write: 223 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 850 msec
OK
rn
11145678abc
111456789abc
Time taken: 15.275 seconds, Fetched: 2 row(s)
hive> -- 匹配开头只有8个连续数字的字符
hive> with tmp1 as> (> select '11145678abc' as rn> union all> select '111456789abc'> union all> select 'd111456789abc'> )> select rn> from tmp1> where rn regexp '^\\d{8}\\D';
Query ID = root_20201217152016_c920f47c-663c-4d4f-a9df-e9c77d20a126
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0297, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0297/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0297
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:20:22,628 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:20:28,804 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.61 sec
MapReduce Total cumulative CPU time: 3 seconds 610 msec
Ended Job = job_1606698967173_0297
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 3.61 sec HDFS Read: 10990 HDFS Write: 198 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 610 msec
OK
rn
11145678abc
Time taken: 13.633 seconds, Fetched: 1 row(s)
hive>
2.2 regexp_replace
语法:
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
返回用替换实例替换INITIAL_STRING中与模式中定义的java正则表达式语法匹配的所有子字符串所产生的字符串。例如,regexp_replace(“foobar”, “oo|ar”, “”)返回’fb。‘注意,在使用预定义字符类时需要注意:使用’\s’作为第二个参数将匹配字母s;’\s’必须匹配空格,等等。
测试数据
create table test_reg(id int,str string);insert into test_reg values (1,'我在学习Hive,大数据。');
insert into test_reg values (2,'Hive,我来了,Coming666。');
insert into test_reg values (3,'666,Hive居然拥有关系型数据库的诸多特性。');
insert into test_reg values (4,'wuwuwu,Hive学习起来还是存在一定难度。');
insert into test_reg values (5,'Hive数据仓库,6666。');
2.2.1 截取字符串中汉字部分
代码:
select id,regexp_replace(str,'([^\\u4E00-\\u9FA5]+)','') new_strfrom test_reg;
测试记录:
hive> > select id,regexp_replace(str,'([^\\u4E00-\\u9FA5]+)','') new_str> from test_reg;
Query ID = root_20201217153549_3a7163ba-365b-4f65-adad-930d42ec385d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0303, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0303/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0303
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:35:56,518 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:36:02,702 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.32 sec
MapReduce Total cumulative CPU time: 6 seconds 320 msec
Ended Job = job_1606698967173_0303
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.32 sec HDFS Read: 8951 HDFS Write: 495 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 320 msec
OK
id new_str
3 居然拥有关系型数据库的诸多特性
4 学习起来还是存在一定难度
5 数据仓库
1 我在学习大数据
2 我来了
Time taken: 14.196 seconds, Fetched: 5 row(s)
hive>
2.2.2 截取字符串中数字部分
代码:
select id,regexp_replace(str,'([^0-9]+)','') new_strfrom test_reg;
测试记录:
hive> > select id,regexp_replace(str,'([^0-9]+)','') new_str> from test_reg;
Query ID = root_20201217153822_cf2389b9-8533-4b82-b472-57df3a1da418
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0304, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0304/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0304
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:38:30,220 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:38:37,427 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.19 sec
MapReduce Total cumulative CPU time: 6 seconds 190 msec
Ended Job = job_1606698967173_0304
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.19 sec HDFS Read: 9139 HDFS Write: 259 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 190 msec
OK
id new_str
3 666
4
5 6666
1
2 666
Time taken: 16.038 seconds, Fetched: 5 row(s)
hive>
2.2.3 截取字符串中字母部分
代码:
select id,regexp_replace(str,'([^a-zA-Z]+)','') new_strfrom test_reg;
测试记录:
hive> > select id,regexp_replace(str,'([^a-zA-Z]+)','') new_str> from test_reg;
Query ID = root_20201217154102_65e60007-0a8a-45d2-94de-bb0bfe11f9e8
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0305, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0305/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0305
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:41:08,696 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:41:14,864 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.02 sec
2020-12-17 15:41:15,891 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.13 sec
MapReduce Total cumulative CPU time: 6 seconds 130 msec
Ended Job = job_1606698967173_0305
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.13 sec HDFS Read: 9145 HDFS Write: 281 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 130 msec
OK
id new_str
3 Hive
4 wuwuwuHive
5 Hive
1 Hive
2 HiveComing
Time taken: 14.205 seconds, Fetched: 5 row(s)
hive>
2.2.4 截取字符串中的大小写字母和数字
代码:
select id,regexp_replace(str,'([^a-zA-Z0-9]+)','') new_strfrom test_reg;
测试记录:
hive> select id,regexp_replace(str,'([^a-zA-Z0-9]+)','') new_str> from test_reg;
Query ID = root_20201217154722_738edcd7-4c2a-4aa1-a66a-acac268f199b
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0310, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0310/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0310
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 0
2020-12-17 15:47:29,743 Stage-1 map = 0%, reduce = 0%
2020-12-17 15:47:36,950 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 6.13 sec
MapReduce Total cumulative CPU time: 6 seconds 130 msec
Ended Job = job_1606698967173_0310
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Cumulative CPU: 6.13 sec HDFS Read: 9123 HDFS Write: 291 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 130 msec
OK
id new_str
3 666Hive
4 wuwuwuHive
5 Hive6666
1 Hive
2 HiveComing666
Time taken: 15.686 seconds, Fetched: 5 row(s)
2.3 regexp_extract
语法:
regexp_extract(string subject, string pattern, int index)
返回使用模式提取的字符串。例如,regexp_extract(‘foothebar’, ‘foo(.*?)(bar)’, 2)返回’bar ‘。‘注意,在使用预定义字符类时需要注意:使用’\s’作为第二个参数将匹配字母s;’\s’必须匹配空格,等等。
测试json数据
create table test_reg2(id int,str string);
insert into test_reg2 values (1,'{"filtertype":"29","filtername":"成人Node","filtertitle":"成人Group","filtersubtype":"","filterid":"29|1","filterValue":"1|4"}');
insert into test_reg2 values (2,'{"filtertitle":"钻级","filtertype":"16","filtersubtype":"","filtername":"四钻/高档","filterid":"16|4",}');
代码:
select id,regexp_extract(str,'(filtertype"\\:")(\\d+)(",)',2) as filtertype,regexp_extract(str,'(filtername"\\:")((\\W*\\w*)|(\\W*))(",)',2) as filtername,regexp_extract(str,'(filtertitle"\\:")((\\W*\\w*)|(\\W*))(",)',2) as filtertitle,regexp_extract(str,'(filterid"\\:")(\\d+\\|\\d+)(",)',2) as filteridfrom test_reg2;
测试记录:
hive> > select id> ,regexp_extract(str,'(filtertype"\\:")(\\d+)(",)',2) as filtertype> ,regexp_extract(str,'(filtername"\\:")((\\W*\\w*)|(\\W*))(",)',2) as filtername> ,regexp_extract(str,'(filtertitle"\\:")((\\W*\\w*)|(\\W*))(",)',2) as filtertitle> ,regexp_extract(str,'(filterid"\\:")(\\d+\\|\\d+)(",)',2) as filterid> from test_reg2;
Query ID = root_20201217164358_dbda76fc-2bc1-41d6-b140-72f590b05502
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1606698967173_0314, Tracking URL = http://hp1:8088/proxy/application_1606698967173_0314/
Kill Command = /opt/cloudera/parcels/CDH-6.3.1-1.cdh6.3.1.p0.1470567/lib/hadoop/bin/hadoop job -kill job_1606698967173_0314
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2020-12-17 16:44:05,767 Stage-1 map = 0%, reduce = 0%
2020-12-17 16:44:12,987 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.93 sec
MapReduce Total cumulative CPU time: 2 seconds 930 msec
Ended Job = job_1606698967173_0314
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Cumulative CPU: 2.93 sec HDFS Read: 5038 HDFS Write: 205 HDFS EC Read: 0 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 930 msec
OK
id filtertype filtername filtertitle filterid
1 29 成人Node 成人Group 29|1
2 16 四钻/高档 钻级 16|4
Time taken: 15.265 seconds, Fetched: 2 row(s)
参考:
1.https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-MathematicalFunctions
2.https://blog.csdn.net/weixin_37536446/article/details/81053172
3.https://www.cnblogs.com/db-record/p/11454325.html
大数据开发之Hive篇12-Hive正则表达式相关推荐
- 大数据开发之Hive优化篇8-Hive Job优化
备注: Hive 版本 2.1.1 文章目录 Hive job优化概述 一.并行执行 二.本地执行 三.合并输入小文件 四.合并输出小文件 五.控制Map/Reduce数 5.1 控制Hive job ...
- 大数据开发之Hive篇3-Hive数据定义语言
备注: Hive 版本 2.1.1 文章目录 一.Hive关系模型概述 1.1.Database 1.2 Table 1.2.1 管理表和外部表 1.2.2 永久表和临时表 1.3 Partition ...
- 大数据开发之Hive篇17-Hive锁机制
备注: Hive 版本 2.1.1 文章目录 一.Hive锁概述 二.Hive 锁相关操作 2.1 Hive的并发性 2.2 查看表的锁 2.3 解锁 三.Hive 事务表锁机制 四.Hive 锁测试 ...
- 大数据开发之Sqoop详细介绍
备注: 测试环境 CDH 6.3.1 Sqoop 1.4.7 文章目录 一.Sqoop概述 二.Sqoop 工具概述 三.Sqoon工具详解 3.1 codegen 3.2 create-hive-t ...
- 高效大数据开发之 bitmap 思想的应用
作者:xmxiong,PCG 运营开发工程师 数据仓库的数据统计,可以归纳为三类:增量类.累计类.留存类.而累计类又分为历史至今的累计与最近一段时间内的累计(比如滚动月活跃天,滚动周活跃天,最近 N ...
- 萌新Java开发实战记录:大数据开发之”IP热力图、地点热门TopN(文章底部附源码)
提示:文章写完后,目录可以自动生成,如何生成可参考右边的帮助文档 目录 一. 课程设计背景概述 1. <IP经纬热力图>概述 2. <电商分析系统>概述 二.需求分析 1.&l ...
- 大数据开发之Hive篇14-Hive归档(Archiving)
备注: Hive 版本 2.1.1 文章目录 一.Hive归档简介 二.Hive 归档操作 参考 一.Hive归档简介 由于HDFS的设计,文件系统中的文件数量直接影响namenode中的内存消耗.虽 ...
- 大数据开发之Hive篇18-Hive的回收站
备注: Hive 版本 2.1.1 一.模拟误删表 误删除了这张表 hive> > drop table ods_fact_sale_orc; OK 二.从回收站恢复表 查看回收表 [ro ...
- 大数据开发之Hive篇8-Hive视图
备注: Hive 版本 2.1.1 文章目录 一.Hive视图介绍 二.视图案例 2.1 视图能够简化用户内的操作 2.2 视图使用户能以多种角度看待同一数据 2.3 视图对重构数据库提供了一定程度的 ...
最新文章
- python_day2基本数据类型
- opencv 通道分离合并
- 【解放日报】除了CEO首席执行官,你了解CIO吗?
- php新闻列表排序,javascript 新闻列表排序简单封装
- linux pid t 头文件_pid和tid及线程调度
- 接口规范 14.转码接口
- 2019 CCF CSP-J2题解
- RAR Extractor Max for Mac(解压缩软件)
- 回顾 2018: 革新的一年
- ubuntu安装vasp_Ubuntu下p4vasp的安装
- Unicode字符集和多字节字符集
- SAP中的输入历史记录设定
- 哈哈哈……~好敷衍的第一篇博客标题~
- 【QNX Hypervisor 2.2 用户手册】4 构建QNX Hypervisor系统
- Qt读写Excel--QXlsx合并单元格、文本对齐7
- k-means均值聚类
- OAK 3D人工智能相机和RealSense系列相机的对比
- python的占位符——%
- 人工智能换脸技术python_人工智能几行代码实现换脸,python+dlib实现图文教程
- centos 安装 Go环境
热门文章
- OkHttpUtils | okhttp-OkGo的使用,完美支持RxJava
- 学习3ds max—做自行车车轮
- Target “pango_windowing“ links to target “Eigen3::Eigen“ but the target was not found. Perhaps a
- 证书文件编码格式介绍
- 高效程序猿的狂暴之路
- 平克四部曲之《白板》
- 1389:买零食【C++】
- Lam Research和VELO3D达成战略协议,使用金属增材制造应用生产半导体资本设备
- HTML简易时钟教程,html5 svg简单的模拟时钟特效-HTML5动画
- python算式运算题目_python的四则运算练习