1、需求

有一天，旁边的做数据分析的同事，发我一个sql语句，说跑了15min多了，查询进度条一直没有进度，叫我帮忙优化一下，语句如下：

select list.u_type as `黑名单拉黑维度`,list.val as `拉黑的筛选值`,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`,reg.order_no as `订单号`,reg.hos_name as `医院名称`,reg.first_dept_name as `一级科室`,reg.second_dept_name as `二级科室`,reg.doctor_name as `医生名称`,split_part(split_part(reg.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`,reg.order_create_time as `挂号时间`,reg.treatment_dt as `就诊时间`,reg.order_status as `订单状态`,reg.product_price as `订单金额`,reg.patient_name as `就诊卡姓名`,reg.patient_cred_no as `身份证号`,reg.patient_phone as `预留手机号`,reg.order_ip as `ip`
from
(
selectorder_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
)reg --15837934条数据
left join
(select u_type,val,op_log
from dw.fact_black_list
where id is not null
) list --402422条数据
on case when list.u_type='PHONE' then val else '非' end=reg.patient_phone or
case when list.u_type='IP' then val else '非' end=reg.order_ip or
case when list.u_type ='CARD_RISK' then val else '非' end=reg.patient_card_no or
case when list.u_type ='UID' then val else '非' end=reg.user_id orgroup by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

2、问题分析

看到以上逻辑后，第一感觉就是在join关联时，严重影响了查询效率，并且其中涉及多个join on ... or ... or 关联的情形。

于是进行了尝试，验证上边的假设：

(1) 首先去掉or后边的所有条件，只保留on匹配，结果很快就跑完了；

(2) 当保留on 匹配，外加1个or匹配时，出结果的速度明显比上边慢下来了；

(3) 当保留on 匹配，外加2个or匹配时，sql查询根本跑不动，查询进度条就一直停滞不前进了。

于是验证了上边的猜想，是由于join on ... or ... or 模糊匹配关联时，or条件导致的查询速度太慢，接下来进行hive查询语句的优化。

3、进行优化

思路：需要避免上边join on ... or ... or 模糊匹配的情况，需要把其拆分开。并且把case...when尽量不要放在on后边。

拆分开后，使用union的方式，查询出数据用了 57秒完成。

with base_entity as( --15837934条数据
select order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
),base_list as          --402422条数据
(select u_type,val,op_log ,case when u_type='PHONE' then val else '非' end aa,case when u_type='IP' then val else '非' end bb,case when u_type ='CARD_RISK' then val else '非' end cc,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list
where id is not null
)select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.aa=entity.patient_phoneunion select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on  list.bb=entity.order_ip unionselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.cc=entity.patient_card_nounionselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.dd=entity.user_id;

拆分开后，使用union all的方式，查询出数据只用了 11秒就完成了。

with base_entity as( --15837934条数据
select order_no,hos_name,first_dept_name,second_dept_name,doctor_name,pay_flow_info,order_create_time,treatment_dt,order_status,product_price,patient_name,patient_cred_no,patient_phone,order_ip,patient_card_no,user_id,wx_openid
from
dw.aggr_reg_entity
where month>='2022-01' and order_status='TOKEN'
),base_list as          --402422条数据
(select u_type,val,op_log ,case when u_type='PHONE' then val else '非' end aa,case when u_type='IP' then val else '非' end bb,case when u_type ='CARD_RISK' then val else '非' end cc,case when u_type ='UID' then val else '非' end dd
from dw.fact_black_list
where id is not null
)select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.aa=entity.patient_phoneunion all select list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on  list.bb=entity.order_ip union allselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.cc=entity.patient_card_nounion allselect list.u_type as `黑名单拉黑维度`
,list.val as `拉黑的筛选值`
,split_part(split_part(list.op_log,'"desc":"',2),'"',1) as `拉黑原因`
,entity.order_no as `订单号`
,entity.hos_name as `医院名称`
,entity.first_dept_name as `一级科室`
,entity.second_dept_name as `二级科室`
,entity.doctor_name as `医生名称`
,split_part(split_part(entity.pay_flow_info,'"WECHAT_OPEN_ID":"',2),'"',1) as `支付open_id`
,entity.order_create_time as `挂号时间`
,entity.treatment_dt as `就诊时间`
,entity.order_status as `订单状态`
,entity.product_price as `订单金额`
,entity.patient_name as `就诊卡姓名`
,entity.patient_cred_no as `身份证号`
,entity.patient_phone as `预留手机号`
,entity.order_ip as `ip`
from base_entity entity
left join base_list list
on list.dd=entity.user_idgroup by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17
;

但是上述的方式显示显得很啰嗦，如果后面需要匹配的or比较多，比如有n个的时候，那么同样的逻辑就要union all n-1次代码看起来相当繁琐，且性能较低。

针对以上问题，也可以采用一种优雅的实现方式：我们知道采用or连接的时候，无非就是base_entity表中的字段在base_list表中匹配到了就成功，对于这种需要匹配就成功的连接方式，我们自然想到hive中高效的实现方式locate()函数，对于该函数的理解，可以具体参考如下文章：
https://blog.csdn.net/godlovedaniel/article/details/125126193

hive中字符串查找函数 instr 和 locate_奔跑者-辉的博客-CSDN博客

4、小结

要避免hive中 join 基于or 形式模糊匹配关联，可以借助于union all的实现方式，或借助于locate()模糊匹配的方法代码简洁优雅，在hive中用途较广，读者也需要务必掌握。

优化join or情形相关推荐

mysql带where的join加索引_MySQL索引分析和优化+JOIN的分类（转）
join : 左右合併 inner join : 只顥示符合修件的資料列 (左右互相比對) left join : 顥示符合條件的右資料列及左邊不符合條件的資料列 (此時右邊的資料會以 NULL 顯示 ...
35 | 别再说不能使用Join了（这次是优化Join查询-下篇）
云里雾里,不知所以- 一.前言现有两张表:t1(1000行数据,a=1001-id)的值,t2(100w行数据) 语句如下: create table t1(id int primary key, ...
4个优化方法，让你能了解join计算过程更透彻
摘要:现如今, 跨源计算的场景越来越多, 数据计算不再单纯局限于单方,而可能来自不同的数据合作方进行联合计算. 本文分享自华为云社区<如何高可靠.高性能地优化join计算过程?4个优化让你掌握其 ...
SQL优化器原理 - Auto Hash Join
这是MaxCompute有关SQL优化器原理的系列文章之一.我们会陆续推出SQL优化器有关优化规则和框架的其他文章.添加钉钉群"关系代数优化技术"(群号11719083)可以获取最 ...
Hive 分桶表原理及优化大表 join 实战
一.什么是分桶表分桶表,比普通表或者分区表有着更为细粒度的数据划分. 举个例子,每天产生的日志可以建立分区表,每个分区在 hdfs 上就是一个目录,这个目录下包含了当天的所有日志记录. 而分桶表,可 ...
MySQL优化学习总结
MySQL 性能优化的最佳20多条经验分享 http://www.jb51.net/article/24392.htm 今天,数据库的操作越来越成为整个应用的性能瓶颈了,这点对于Web应用尤其明显.关 ...
mysql优化之query优化
主要概述:在 MySQL 中有一个专门负责优化 SELECT 语句的优化器模块,这就是我们本节将要重点分析的 MySQL Optimizer,其主要的功能就是通过计算分析系统中收集的各种统计信息,为客 ...
8.2 Query 语句优化基本思路和原则
在分析如何优化MySQL Query 之前,我们需要先了解一下Query 语句优化的基本思路和原则.一般来说,Query 语句的优化思路和原则主要提现在以下几个方面: 1. 优化更需要优化的Quer ...
mysql query 优化_第 8 章 MySQL 数据库 Query 的优化
前言: 在之前"影响 MySQL 应用系统性能的相关因素"一章中我们就已经分析过了Query语句对数据库性能的影响非常大,所以本章将专门针对 MySQL 的 Query 语句的优化 ...

优化join or情形

1、需求

2、问题分析

3、进行优化

4、小结

优化join or情形相关推荐

最新文章

热门文章