通常,在hive中对于模糊匹配关联方面的查询效率是非常低的,如or 关联,基于like的模糊匹配关联,对于此类问题往往需要找到好的优化方案。

对于join关联时涉及多个or连接,本次优化方案转化为union 或 union all的实现形式。

1、需求

有一天,旁边的做数据分析的同事,发我一个sql语句,说跑了15min多了,查询进度条一直没有进度,叫我帮忙优化一下,语句如下:

select list.u_type as `黑名单拉黑维度`,list.val as `拉黑的筛选值`,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`,reg.order_no as `订单号`,reg.hos_name as `医院名称`,reg.first_dept_name as `一级科室`,reg.second_dept_name as `二级科室`,reg.doctor_name as `医生名称`,split_part(split_part(reg.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`,reg.order_create_time as `挂号时间`,reg.treatment_dt as `就诊时间`,reg.order_status as `订单状态`,reg.product_price as `订单金额`,reg.patient_name as `就诊卡姓名`,reg.patient_cred_no as `身份证号`,reg.patient_phone as `预留手机号`,reg.order_ip as `ip`
from
(
selectorder_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
)reg --15837934条数据
left join
(select u_type,val,op_log
from dw.fact_black_list
where id is not null
) list --402422条数据
on case when list.u_type='PHONE' then val else '非' end=reg.patient_phone or
case when list.u_type='IP' then val else '非' end=reg.order_ip or
case when list.u_type ='CARD_RISK' then val else '非' end=reg.patient_card_no or
case when list.u_type ='UID' then val else '非' end=reg.user_id orgroup by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

2、问题分析

看到以上逻辑后,第一感觉就是在join关联时,严重影响了查询效率,并且其中涉及多个join on ... or ... or 关联的情形。

于是进行了尝试,验证上边的假设:

(1) 首先去掉or后边的所有条件,只保留on匹配,结果很快就跑完了;

(2) 当保留on 匹配,外加1个or匹配时,出结果的速度明显比上边慢下来了;

(3) 当保留on 匹配,外加2个or匹配时,sql查询根本跑不动,查询进度条就一直停滞不前进了。

于是验证了上边的猜想,是由于join on ... or ... or 模糊匹配关联时,or条件导致的查询速度太慢,接下来进行hive查询语句的优化。

3、进行优化

思路:需要避免上边join on ... or ... or 模糊匹配的情况,需要把其拆分开。并且把case...when尽量不要放在on后边。

拆分开后,使用union的方式,查询出数据用了 57秒 完成。

with base_entity as( --15837934条数据
select order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
),base_list as          --402422条数据
(select u_type,val,op_log ,case when u_type='PHONE' then val else '非' end aa,case when u_type='IP' then val else '非' end bb,case when u_type ='CARD_RISK' then val else '非' end cc,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list
where id is not null
)select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.aa=entity.patient_phoneunion select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on  list.bb=entity.order_ip unionselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.cc=entity.patient_card_nounionselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.dd=entity.user_id;

拆分开后,使用union all的方式,查询出数据只用了 11秒 就完成了。

with base_entity as( --15837934条数据
select order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
),base_list as          --402422条数据
(select u_type,val,op_log ,case when u_type='PHONE' then val else '非' end aa,case when u_type='IP' then val else '非' end bb,case when u_type ='CARD_RISK' then val else '非' end cc,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list
where id is not null
)select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.aa=entity.patient_phoneunion all select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on  list.bb=entity.order_ip union allselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.cc=entity.patient_card_nounion allselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.dd=entity.user_idgroup by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

但是上述的方式显示显得很啰嗦,如果后面需要匹配的or比较多,比如有n个的时候,那么同样的逻辑就要union all  n-1次代码看起来相当繁琐,且性能较低。

针对以上问题,也可以采用一种优雅的实现方式:我们知道采用or连接的时候,无非就是base_entity表中的字段在base_list表中匹配到了就成功,对于这种需要匹配就成功的连接方式,我们自然想到hive中高效的实现方式locate()函数,对于该函数的理解,可以具体参考如下文章:
https://blog.csdn.net/godlovedaniel/article/details/125126193

hive中字符串查找函数 instr 和 locate_奔跑者-辉的博客-CSDN博客

4、小结

要避免hive中 join 基于or 形式模糊匹配关联,可以借助于union all的实现方式,或借助于locate()模糊匹配的方法代码简洁优雅,在hive中用途较广,读者也需要务必掌握。

优化join or情形相关推荐

  1. mysql带where的join加索引_MySQL索引分析和优化+JOIN的分类(转)

    join : 左右合併 inner join : 只顥示符合修件的資料列 (左右互相比對) left join : 顥示符合條件的右資料列及左邊不符合條件的資料列 (此時右邊的資料會以 NULL 顯示 ...

  2. 35 | 别再说不能使用Join了(这次是优化Join查询-下篇)

    云里雾里,不知所以- 一.前言 现有两张表:t1(1000行数据,a=1001-id)的值,t2(100w行数据) 语句如下: create table t1(id int primary key, ...

  3. 4个优化方法,让你能了解join计算过程更透彻

    摘要:现如今, 跨源计算的场景越来越多, 数据计算不再单纯局限于单方,而可能来自不同的数据合作方进行联合计算. 本文分享自华为云社区<如何高可靠.高性能地优化join计算过程?4个优化让你掌握其 ...

  4. SQL优化器原理 - Auto Hash Join

    这是MaxCompute有关SQL优化器原理的系列文章之一.我们会陆续推出SQL优化器有关优化规则和框架的其他文章.添加钉钉群"关系代数优化技术"(群号11719083)可以获取最 ...

  5. Hive 分桶表原理及优化大表 join 实战

    一.什么是分桶表 分桶表,比普通表或者分区表有着更为细粒度的数据划分. 举个例子,每天产生的日志可以建立分区表,每个分区在 hdfs 上就是一个目录,这个目录下包含了当天的所有日志记录. 而分桶表,可 ...

  6. MySQL优化学习总结

    MySQL 性能优化的最佳20多条经验分享 http://www.jb51.net/article/24392.htm 今天,数据库的操作越来越成为整个应用的性能瓶颈了,这点对于Web应用尤其明显.关 ...

  7. mysql优化之query优化

    主要概述:在 MySQL 中有一个专门负责优化 SELECT 语句的优化器模块,这就是我们本节将要重点分析的 MySQL Optimizer,其主要的功能就是通过计算分析系统中收集的各种统计信息,为客 ...

  8. 8.2 Query 语句优化基本思路和原则

    在分析如何优化MySQL Query 之前,我们需要先了解一下Query 语句优化的基本思路和原则.一 般来说,Query 语句的优化思路和原则主要提现在以下几个方面: 1. 优化更需要优化的Quer ...

  9. mysql query 优化_第 8 章 MySQL 数据库 Query 的优化

    前言: 在之前"影响 MySQL 应用系统性能的相关因素"一章中我们就已经分析过了Query语句对数据库性能的影响非常大,所以本章将专门针对 MySQL 的 Query 语句的优化 ...

最新文章

  1. RabbitMQ 入门系列(9)— Python 的 pika 库常用函数及参数说明
  2. go3--常量和类型转换
  3. mergesort java_归并排序详解(MergeSort)递归和非递归实现
  4. C#3.0官方编码规范
  5. python列表内数字排序_如何在Python中手动排序数字列表?
  6. 【若依(ruoyi)】部门查询SQL
  7. cassandra本地连接失败_无法连接到本地Cassandra实例?
  8. linux中磁盘的iused,Linux 磁盘与文件系统管理
  9. linux虚拟网卡上网,Linux添加虚拟网卡的多种方法
  10. (2)网络基础之IP
  11. 让Oracle跑得更快 ——博文视点大讲堂34期活动圆满结束
  12. 计算机组成原理实验单周期处理,计算机组成原理实验报告1-单周期.doc
  13. 计算机二级资料(公共基础知识、考纲、历年真题、VB、Java、Access、C/C++)---百度网盘下载
  14. linux netperf,linux netperf的安装
  15. 缩放指数型线性单元(SELU)
  16. 欧美音乐史上最经典的歌曲
  17. Unity3D插件 Doozy UI 学习(三):UI Element
  18. Mac使用ssh密钥登录Linux
  19. 技术学习:Python(21)|爬虫篇|selenium自动化操作浏览器
  20. 学业水平计算机考试考点宣传标语,中考考点宣传标语(学生方面)

热门文章

  1. SLF4j 和 common-logging
  2. 纵观全局Struts1与Struts2的基本区别
  3. 毕业一年了,职业生涯规划
  4. 北科天绘 16线3维激光雷达开发教程
  5. 【Qt 从入门到入土】下篇
  6. 【网络基础】Openflow协议问题集合
  7. antd 日期时间选择_AntD日期选择器组件DatePicker默认展示当前时间前一个周四
  8. Qt qextserial进行串口数据采集,qcustomplot进行绘图,曲线实时显示横纵坐标辅助线
  9. FlyAI小课堂:Tensorflow-分布式训练
  10. 终于理解了RNN里面的time_step