十一、Hive SQL的优化

本章只是从HQL层面介绍一下，日常开发HQL中需要注意的一些优化点，不涉及Hadoop层面的参数、配置等优化。

其中大部分是我之前发过的博客文章，这里整理了下。

11.1 使用分区剪裁、列剪裁

在SELECT中，只拿需要的列，如果有，尽量使用分区过滤，少用SELECT *。

在分区剪裁中，当使用外关联时，如果将副表的过滤条件写在Where后面，那么就会先全表关联，之后再过滤，比如：

SELECT a.id

FROM lxw1234_a a

left outer join t_lxw1234_partitioned b

ON (a.id = b.url);

WHERE b.day = ‘2015-05-10′

正确的写法是写在ON后面：

SELECT a.id

FROM lxw1234_a a

left outer join t_lxw1234_partitioned b

ON (a.id = b.url AND b.day = ‘2015-05-10′);

或者直接写成子查询：

SELECT a.id

FROM lxw1234_a a

left outer join (SELECT url FROM t_lxw1234_partitioned WHERE day = ‘2015-05-10′) b

ON (a.id = b.url)

11.2 少用COUNT DISTINCT

数据量小的时候无所谓，数据量大的情况下，由于COUNT DISTINCT操作需要用一个Reduce Task来完成，这一个Reduce需要处理的数据量太大，就会导致整个Job很难完成，一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换：

SELECT day,

COUNT(DISTINCT id) AS uv

FROM lxw1234

GROUP BY day

可以转换成：

SELECT day,

COUNT(id) AS uv

FROM (SELECT day,id FROM lxw1234 GROUP BY day,id) a

GROUP BY day;

虽然会多用一个Job来完成，但在数据量大的情况下，这个绝对是值得的。

11.3 是否存在多对多的关联

只要遇到表关联，就必须得调研一下，是否存在多对多的关联，起码得保证有一个表或者结果集的关联键不重复。

如果某一个关联键的记录数非常多，那么分配到该Reduce Task中的数据量将非常大，导致整个Job很难完成，甚至根本跑不出来。

还有就是避免笛卡尔积，同理，如果某一个键的数据量非常大，也是很难完成Job的。

11.4 合理使用MapJoin

关于MapJoin的原理和机制，请参考 [一起学Hive]之十。

MapJoin中小表的大小可以用参数来调节。

11.5 合理使用Union All

对同一张表的union all 要比multi insert快的多。

具体请见：

对同一张表的union all 要比多重insert快的多，
原因是hive本身对这种union all做过优化，即只扫描一次源表；http://www.apacheserver.net/How-is-Union-All-optimized-in-Hive-at229466.htm而多重insert也只扫描一次，但应为要insert到多个分区，所以做了很多其他的事情，导致消耗的时间非常长；
希望大家在开发的时候多测，多试！lxw_test3 12亿左右记录数Union all : 耗时7分钟左右Java代码  收藏代码create table lxw_test5 as   select type,popt_id,login_date   from (  select 'm3_login' as type,popt_id,login_date    from lxw_test3   where login_date>='2012-02-01' and login_date<'2012-05-01'   union all   select 'mn_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-05-01' and login_date<='2012-05-09'   union all   select 'm3_g_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1'   union all   select 'm3_l_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='2'   union all   select 'm3_s_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='3'   union all   select 'm3_o_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='4'   union all   select 'mn_g_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='1'   union all   select 'mn_l_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='2'   union all   select 'mn_s_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='3'   union all   select 'mn_o_login' as type,popt_id,login_date   from lxw_test3   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='4'   ) x  多重insert耗时25分钟左右：Java代码  收藏代码from lxw_test3   insert overwrite table lxw_test6 partition (flag = '1')   select 'm3_login' as type,popt_id,login_date    where login_date>='2012-02-01' and login_date<'2012-05-01'   insert overwrite table lxw_test6 partition (flag = '2')   select 'mn_login' as type,popt_id,login_date   where login_date>='2012-05-01' and login_date<='2012-05-09'   insert overwrite table lxw_test6 partition (flag = '3')   select 'm3_g_login' as type,popt_id,login_date   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1'   insert overwrite table lxw_test6 partition (flag = '4')   select 'm3_l_login' as type,popt_id,login_date   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='2'   insert overwrite table lxw_test6 partition (flag = '5')   select 'm3_s_login' as type,popt_id,login_date   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='3'   insert overwrite table lxw_test6 partition (flag = '6')   select 'm3_o_login' as type,popt_id,login_date   where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='4'   insert overwrite table lxw_test6 partition (flag = '7')   select 'mn_g_login' as type,popt_id,login_date   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='1'   insert overwrite table lxw_test6 partition (flag = '8')   select 'mn_l_login' as type,popt_id,login_date   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='2'   insert overwrite table lxw_test6 partition (flag = '9')   select 'mn_s_login' as type,popt_id,login_date   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='3'   insert overwrite table lxw_test6 partition (flag = '10')   select 'mn_o_login' as type,popt_id,login_date   where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='4'

11.6 并行执行Job

用过oracle rac的应该都知道parallel的用途。

并行执行的确可以大的加快任务的执行速率，但不会减少其占用的资源。

在hive中也有并行执行的选项。

具体请见：

http://superlxw1234.iteye.com/blog/1703713

用过oracle rac的应该都知道parallel的用途。并行执行的确可以大的加快任务的执行速率，但不会减少其占用的资源。在hive中也有并行执行的选项。set hive.exec.parallel=true;   //打开任务并行执行set hive.exec.parallel.thread.number=16; //同一个sql允许最大并行度，默认为8。对于同一个SQL产生的JOB,如果不存在依赖的情况下，将会并行启动JOB，比如：Sql代码  收藏代码from (  select phone,to_phone, substr(to_phone,-1) as key  from youni_contact4_lxw   where youni_id='1'   and length(to_phone) = 11   and  substr(to_phone,1,2) IN ('13','14','15','18')   group by phone,to_phone, substr(to_phone,-1)   ) t  insert overwrite table youni_contact41_lxw partition(pt='0')  select phone,to_phone where key='0'  insert overwrite table youni_contact41_lxw partition(pt='1')  select phone,to_phone where key='1'  insert overwrite table youni_contact41_lxw partition(pt='2')  select phone,to_phone where key='2'  insert overwrite table youni_contact41_lxw partition(pt='3')  select phone,to_phone where key='3'  insert overwrite table youni_contact41_lxw partition(pt='4')  select phone,to_phone where key='4'  insert overwrite table youni_contact41_lxw partition(pt='5')  select phone,to_phone where key='5'  insert overwrite table youni_contact41_lxw partition(pt='6')  select phone,to_phone where key='6'  insert overwrite table youni_contact41_lxw partition(pt='7')  select phone,to_phone where key='7'  insert overwrite table youni_contact41_lxw partition(pt='8')  select phone,to_phone where key='8'  insert overwrite table youni_contact41_lxw partition(pt='9')  select phone,to_phone where key='9';  该SQL产生11个job，第一个job为生成临时表的job，后续job都依赖它，这时不会有并行启动，第一个job完成后，后续的job都会并行启动。运行时间比较：不启用并行：35分钟启用8个并行：10分钟启用16个并行：6分钟当然，得是在系统资源比较空闲的时候才有优势，否则，没资源，并行也起不来。

11.7 使用本地MR

如果在hive中运行的sql本身数据量很小，那么使用本地mr的效率要比提交到Hadoop集群中运行快很多。

具体请见：

http://superlxw1234.iteye.com/blog/1703546

如果在hive中运行的sql本身数据量很小，那么使用本地mr的效率要比分布式的快很多。。比如： Sql代码  收藏代码hive> select 1 from dual;  Total MapReduce jobs = 1  Launching Job 1 out of 1  Number of reduce tasks is set to 0 since there's no reduce operator  Starting Job = job_201208151631_2040444, Tracking URL = http://jt.dc.sh-wgq.sdo.com:50030/jobdetails.jsp?jobid=job_201208151631_2040444  Kill Command = /home/hdfs/hadoop-current/bin/hadoop job  -Dmapred.job.tracker=10.133.10.103:50020 -kill job_201208151631_2040444  2012-10-23 10:55:17,646 Stage-1 map = 0%,  reduce = 0%  2012-10-23 10:55:27,807 Stage-1 map = 100%,  reduce = 0%  Ended Job = job_201208151631_2040444  OK  1  Time taken: 17.853 seconds  set hive.exec.mode.local.auto=true;  //开启本地mr//设置local mr的最大输入数据量,当输入数据量小于这个值的时候会采用local  mr的方式set hive.exec.mode.local.auto.inputbytes.max=50000000;//设置local mr的最大输入文件个数,当输入文件个数小于这个值的时候会采用local mr的方式set hive.exec.mode.local.auto.tasks.max=10;当这三个参数同时成立时候，才会采用本地mrSql代码  收藏代码hive> select 1 from dual;               Total MapReduce jobs = 1  Launching Job 1 out of 1  Number of reduce tasks is set to 0 since there's no reduce operator  Execution log at: /tmp/liuxiaowen/liuxiaowen_20121023105757_31c966be-ee79-4c23-a467-648290b338ac.log  Job running in-process (local Hadoop)  2012-10-23 10:58:03,728 null map = 100%,  reduce = 0%  Ended Job = job_local_0001  OK  1  Time taken: 4.842 seconds

11.8 合理使用动态分区

参见 [一起学Hive]之六-Hive的动态分区

http://lxw1234.com/archives/2015/06/286.htm

11.9 避免数据倾斜

数据倾斜是Hive开发中对性能影响的一大杀手。

症状：

任务迚度长时间维持在99%（或100%）;

查看任务监控页面，发现只有少量（1个或几个）reduce子任务未完成。

本地读写数据量很大。

导致数据倾斜的操作：

GROUP BY, COUNT DISTINCT, join

原因：

key分布不均匀

业务数据本身特点

这里列出一些常用的数据倾斜解决办法：

使用COUNT DISTINCT和GROUP BY造成的数据倾斜：

存在大量空值或NULL，或者某一个值的记录特别多，可以先把该值过滤掉，在最后单独处理:

SELECT CAST(COUNT(DISTINCT imei)+1 AS bigint)

FROM lxw1234 where pt = ‘2012-05-28′

AND imei <> ‘lxw1234′ ;

比如某一天的IMEI值为’lxw1234’的特别多，当我要统计总的IMEI数，可以先统计不为’lxw1234’的，之后再加1.

多重COUNT DISTINCT

通常使用UNION ALL + ROW_NUMBER() + SUM + GROUP BY来变通实现。

使用JOIN引起的数据倾斜

关联键存在大量空值或者某一特殊值，如”NULL”

空值单独处理，不参与关联；

空值或特殊值加随机数作为关联键；

不同数据类型的字段关联

转换为同一数据类型之后再做关联

11.10 控制Map数和Reduce数

参见http://lxw1234.com/archives/2015/04/15.htm

11.11 中间结果压缩

参见 http://superlxw1234.iteye.com/blog/1741103

中间Lzo,最终GzipJava代码  收藏代码set mapred.output.compress = true;  set mapred.output.compression.codec = org.apache.hadoop.io.compress.GzipCodec;  set mapred.output.compression.type = BLOCK;  set mapred.compress.map.output = true;  set mapred.map.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec;  set hive.exec.compress.output = true;  set hive.exec.compress.intermediate = true;  set hive.intermediate.compression.codec = org.apache.hadoop.io.compress.LzoCodec;  中间Lzo,最终结果不压缩Java代码  收藏代码set mapred.output.compress = true;  set mapred.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec;  set mapred.output.compression.type = BLOCK;  set mapred.compress.map.output = true;  set mapred.map.output.compression.codec = org.apache.hadoop.io.compress.LzoCodec;  set hive.exec.compress.intermediate = true;  set hive.intermediate.compression.codec = org.apache.hadoop.io.compress.LzoCodec;

11.12 其他

在MapReduce的WEB界面上，关注Hive Job执行的情况；
了解HQL -> MapReduce的过程；
HQL优化其实也是MapReduce的优化，作为分布式计算模型，其最核心的地方就是要确保每个节点上分布的数据均匀，才能最大程度发挥它的威力，否则，某一个不均匀的节点就会拖后腿。

[一起学Hive]之十二-Hive SQL的优化相关推荐

从零开始学Pytorch（十二）之凸优化
尽管优化方法可以最小化深度学习中的损失函数值,但本质上优化方法达到的目标与深度学习的目标并不相同. 优化方法目标:训练集损失函数值深度学习目标:测试集损失函数值(泛化性) %matplotlib i ...
2021年大数据Hive（十二）：Hive综合案例！！！
全网最详细的大数据Hive文章系列,强烈建议收藏加关注! 新文章都已经列出历史文章目录,帮助大家回顾前面的知识重点. 目录系列历史文章前言 Hive综合案例一.需求描述二.项目表的字段三.进 ...
零基础入门学Python（十二）—— 魔法方法（下）
零基础入门学Python系列内容的学习目录→\rightarrow→零基础入门学Python系列内容汇总. 魔法方法(下) 1. 构造和析构 2. 算术运算 3. 简单定制 4. 属性访问 5. 描述 ...
第二十二章 SQL命令 CREATE TRIGGER（二）
文章目录第二十二章 SQL命令 CREATE TRIGGER(二) SQL触发器代码 ObjectScript触发代码字段引用和伪字段引用引用流属性引用SQLComputed属性标签方法调 ...
第十二章 SQL聚合函数 VARIANCE, VAR_SAMP, VAR_POP
文章目录第十二章 SQL聚合函数 VARIANCE, VAR_SAMP, VAR_POP 大纲参数描述当前事务期间所做的更改示例第十二章 SQL聚合函数 VARIANCE, VAR_SAM ...
第五十二章 SQL函数 DEGREES
文章目录第五十二章 SQL函数 DEGREES 大纲参数描述示例第五十二章 SQL函数 DEGREES 将弧度转换为角度的数值函数. 大纲 DEGREES(numeric-expressio ...
第二十二章 SQL函数 CAST（一）
文章目录第二十二章 SQL函数 CAST(一) 大纲参数描述转换数字字符串类型转换转换为DATE.TIME和TIMESTAMP 转换NULL和空字符串转换日期第二十二章 SQL函数 C ...
零基础入门学Python（十二）—— 魔法方法（上）
零基础入门学Python系列内容的学习目录→\rightarrow→零基础入门学Python系列内容汇总. 魔法方法(上) 1. 构造和析构 1.1 _ _ init _ _(self[, ...]) ...
第六十二章 SQL命令 OPEN
文章目录第六十二章 SQL命令 OPEN 大纲参数描述示例第六十二章 SQL命令 OPEN 打开游标. 大纲 OPEN cursor-name 参数 cursor-name - 游标的名称, ...

[一起学Hive]之十二-Hive SQL的优化

十一、Hive SQL的优化

11.1 使用分区剪裁、列剪裁

11.2 少用COUNT DISTINCT

11.3 是否存在多对多的关联

11.4 合理使用MapJoin

11.5 合理使用Union All

11.6 并行执行Job

11.7 使用本地MR

11.8 合理使用动态分区

11.9 避免数据倾斜

11.10 控制Map数和Reduce数

11.11 中间结果压缩

11.12 其他

[一起学Hive]之十二-Hive SQL的优化相关推荐

最新文章

热门文章