Hive 优化（通用版）

hive优化

Hive 优化核心思想：把Hive SQL 当做Mapreduce程序去优化

以下SQL不会转为Mapreduce来执行：

select仅查询本表字段

where仅对本表字段做条件过滤

Explain 显示执行计划：EXPLAIN [EXTENDED] query

hive> explain extended select * from student;
OK
Explain
STAGE DEPENDENCIES:Stage-0 is a root stageSTAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1Processor Tree:TableScanalias: studentStatistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: NONEGatherStats: falseSelect Operatorexpressions: id (type: int), name (type: string), likes (type: array<string>), address (type: map<string,string>)outputColumnNames: _col0, _col1, _col2, _col3Statistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: NONEListSinkTime taken: 0.231 seconds, Fetched: 18 row(s)========================================================hive> explain extended select count(*) from student;
OK
Explain
STAGE DEPENDENCIES:Stage-1 is a root stageStage-0 depends on stages: Stage-1STAGE PLANS:Stage: Stage-1Map ReduceMap Operator Tree:TableScanalias: studentStatistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: COMPLETEGatherStats: falseSelect OperatorStatistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: COMPLETEGroup By Operatoraggregations: count()mode: hashoutputColumnNames: _col0Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operatornull sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEtag: -1value expressions: _col0 (type: bigint)auto parallelism: falsePath -> Alias:hdfs://mycluster/user/hive/warehouse/student [student]Path -> Partition:hdfs://mycluster/user/hive/warehouse/student Partitionbase file name: studentinput format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatproperties:bucket_count -1colelction.delim -column.name.delimiter ,columns id,name,likes,addresscolumns.comments columns.types int:string:array<string>:map<string,string>field.delim ,file.inputformat org.apache.hadoop.mapred.TextInputFormatfile.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatlocation hdfs://mycluster/user/hive/warehouse/studentmapkey.delim :name default.studentnumFiles 1numRows 0rawDataSize 0serialization.ddl struct student { i32 id, string name, list<string> likes, map<string,string> address}serialization.format ,serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDetotalSize 618transient_lastDdlTime 1624695643serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeinput format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatproperties:bucket_count -1colelction.delim -column.name.delimiter ,columns id,name,likes,addresscolumns.comments columns.types int:string:array<string>:map<string,string>field.delim ,file.inputformat org.apache.hadoop.mapred.TextInputFormatfile.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatlocation hdfs://mycluster/user/hive/warehouse/studentmapkey.delim :name default.studentnumFiles 1numRows 0rawDataSize 0serialization.ddl struct student { i32 id, string name, list<string> likes, map<string,string> address}serialization.format ,serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDetotalSize 618transient_lastDdlTime 1624695643serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: default.studentname: default.studentTruncated Path -> Alias:/student [student]Needs Tagging: falseReduce Operator Tree:Group By Operatoraggregations: count(VALUE._col0)mode: mergepartialoutputColumnNames: _col0Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEFile Output Operatorcompressed: falseGlobalTableId: 0directory: hdfs://mycluster/tmp/hive/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-26-39_045_671258483012844310-1/-mr-10001/.hive-staging_hive_2021-07-03_09-26-39_045_671258483012844
310-1/-ext-10002            NumFilesPerFileSink: 1Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEStats Publishing Key Prefix: hdfs://mycluster/tmp/hive/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-26-39_045_671258483012844310-1/-mr-10001/.hive-staging_hive_2021-07-03_09-26-39_0
45_671258483012844310-1/-ext-10002/            table:input format: org.apache.hadoop.mapred.SequenceFileInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormatproperties:columns _col0columns.types bigintescape.delim \hive.serialization.extend.additional.nesting.levels trueserialization.escape.crlf trueserialization.format 1serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeTotalFiles: 1GatherStats: falseMultiFileSpray: falseStage: Stage-0Fetch Operatorlimit: -1Processor Tree:ListSinkTime taken: 0.183 seconds, Fetched: 121 row(s)

Hive抓取策略

Hive中对某些情况的查询不需要使用MapReduce计算

set hive.fetch.task.conversion=none/more;

1、hive的默认抓取策略是more，如果抓取策略设置为none则每次都需要执行MapReduce操作
hive> set hive.fetch.task.conversion;
hive.fetch.task.conversion=more
hive> set hive.fetch.task.conversion=none;
hive> select id,name from student;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703091722_537a3295-d844-49a4-93d7-9160cb501ba9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1625270715754_0005, Tracking URL = http://node03:8088/proxy/application_1625270715754_0005/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-07-03 09:17:34,420 Stage-1 map = 0%,  reduce = 0%
2021-07-03 09:17:44,689 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
MapReduce Total cumulative CPU time: 1 seconds 460 msec
Ended Job = job_1625270715754_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.46 sec   HDFS Read: 4829 HDFS Write: 369 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 460 msec
OK
id  name
1   小红1
2   小红2
3   小红3
4   小红4
5   小红5
6   小红6
7   小红7
8   小红8
9   小红9
10  小红10
Time taken: 23.477 seconds, Fetched: 10 row(s)

hive运行方式：本地模式、集群模式（默认）

1、本地模式：

开启本地模式：set hive.exec.mode.local.auto=true;

注意：hive.exec.mode.local.auto.inputbytes.max默认值为128M，表示加载文件的最大值，若大于该配置仍会以集群方式来运行！

1、查看hive运行模式，默认集群
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive>2、查看本地模式最大文件大小默认128M
hive> set hive.exec.mode.local.auto.inputbytes.max;
hive.exec.mode.local.auto.inputbytes.max=134217728
hive>

并行计算

通过设置以下参数开启并行模式：set hive.exec.parallel=true;

注意：hive.exec.parallel.thread.number; 一次SQL计算中允许并行执行的job个数的最大值

1、并行计算默认关闭，默认并行计算线程数为8
hive> set hive.exec.parallel;
hive.exec.parallel=false
hive> set hive.exec.parallel.thread.number;
hive.exec.parallel.thread.number=8
hive> 2、不开启并行计算下的hive运行
hive> set hive.strict.checks.cartesian.product;
hive.strict.checks.cartesian.product=true
hive> set hive.strict.checks.cartesian.product=false;
hive> select t1.c1,t2.age from (select count(name) c1 from student) t1,(select count(*) age from student) t2;
Warning: Map Join MAPJOIN[21][bigTable=?] in task 'Stage-4:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=?] in task 'Stage-5:MAPRED' is a cross product
Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703094655_d4c2384a-17b5-48c7-b997-74bfda966179
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1625270715754_0006, Tracking URL = http://node03:8088/proxy/application_1625270715754_0006/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-07-03 09:47:07,363 Stage-1 map = 0%,  reduce = 0%
2021-07-03 09:47:17,905 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec
2021-07-03 09:47:28,908 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.32 sec
MapReduce Total cumulative CPU time: 3 seconds 320 msec
Ended Job = job_1625270715754_0006
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1625270715754_0007, Tracking URL = http://node03:8088/proxy/application_1625270715754_0007/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2021-07-03 09:47:44,525 Stage-3 map = 0%,  reduce = 0%
2021-07-03 09:47:54,739 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.07 sec
2021-07-03 09:48:02,938 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 3.56 sec
MapReduce Total cumulative CPU time: 3 seconds 560 msec
Ended Job = job_1625270715754_0007
Stage-7 is selected by condition resolver.
Stage-8 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
2021-07-03 09:48:15 Starting to launch local task to process map join;  maximum memory = 518979584
2021-07-03 09:48:17 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-46-55_555_9027395853760823729-1/-local-10006/HashTab
le-Stage-4/MapJoin-mapfile01--.hashtable2021-07-03 09:48:17 Uploaded 1 File to: file:/tmp/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-46-55_555_9027395853760823729-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (
278 bytes)2021-07-03 09:48:17   End of local task; Time Taken: 1.544 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1625270715754_0008, Tracking URL = http://node03:8088/proxy/application_1625270715754_0008/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0008
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2021-07-03 09:48:28,131 Stage-4 map = 0%,  reduce = 0%
2021-07-03 09:48:35,279 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_1625270715754_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.32 sec   HDFS Read: 8497 HDFS Write: 114 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 3.56 sec   HDFS Read: 8383 HDFS Write: 114 SUCCESS
Stage-Stage-4: Map: 1   Cumulative CPU: 1.36 sec   HDFS Read: 4258 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 240 msec
OK
t1.c1   t2.age
10  10
Time taken: 100.836 seconds, Fetched: 1 row(s)3、设置并行计算
set hive.exec.parallel=true;
hive> select t1.c1,t2.age from (select count(name) c1 from student) t1,(select count(*) age from student) t2;
Warning: Map Join MAPJOIN[21][bigTable=?] in task 'Stage-4:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=?] in task 'Stage-5:MAPRED' is a cross product
Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703095117_dc872c31-34d2-4f84-a7ae-d082e8bfa55d
Total jobs = 5
Launching Job 1 out of 5
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1625270715754_0010, Tracking URL = http://node03:8088/proxy/application_1625270715754_0010/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0010
Starting Job = job_1625270715754_0009, Tracking URL = http://node03:8088/proxy/application_1625270715754_0009/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2021-07-03 09:51:31,266 Stage-3 map = 0%,  reduce = 0%
2021-07-03 09:51:39,621 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.35 sec
2021-07-03 09:51:47,881 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1625270715754_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-07-03 09:52:02,620 Stage-1 map = 0%,  reduce = 0%
2021-07-03 09:52:10,052 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
2021-07-03 09:52:18,494 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1625270715754_0010
Stage-7 is selected by condition resolver.
Stage-8 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
2021-07-03 09:52:31 Starting to launch local task to process map join;  maximum memory = 518979584
2021-07-03 09:52:33 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/bcba216b-18cc-4360-8df6-28741fc1277f/hive_2021-07-03_09-51-17_847_4787078123986205503-1/-local-10006/HashTab
le-Stage-4/MapJoin-mapfile01--.hashtable2021-07-03 09:52:33 Uploaded 1 File to: file:/tmp/root/bcba216b-18cc-4360-8df6-28741fc1277f/hive_2021-07-03_09-51-17_847_4787078123986205503-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (
278 bytes)2021-07-03 09:52:33   End of local task; Time Taken: 1.538 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1625270715754_0011, Tracking URL = http://node03:8088/proxy/application_1625270715754_0011/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0011
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2021-07-03 09:52:43,157 Stage-4 map = 0%,  reduce = 0%
2021-07-03 09:52:50,333 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
MapReduce Total cumulative CPU time: 1 seconds 340 msec
Ended Job = job_1625270715754_0011
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 2.69 sec   HDFS Read: 8383 HDFS Write: 114 SUCCESS
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.91 sec   HDFS Read: 8497 HDFS Write: 114 SUCCESS
Stage-Stage-4: Map: 1   Cumulative CPU: 1.34 sec   HDFS Read: 4258 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 940 msec
OK
t1.c1   t2.age
10  10
Time taken: 94.155 seconds, Fetched: 1 row(s)

严格模式

通过设置以下参数开启严格模式：set hive.mapred.mode=strict;(默认为非严格模式nonstrict)

启用严格模式后，会对表的查询进行如下限制：

1、对于分区表，必须添加where对于分区字段的条件过滤；

2、order by语句必须包含limit输出限制；

3、限制执行笛卡尔积的查询。

1、默认未开启严格模式
hive> set hive.mapred.mode;
hive.mapred.mode is undefined
hive> set hive.mapred.mode=strict;
hive> 2、对于分区表，必须添加where对于分区字段的条件过滤
hive> select * from student_dynamic_partition;
FAILED: SemanticException Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.large.query to false and th
at hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "student_dynamic_partition" Table "student_dynamic_partition"hive> select * from student_dynamic_partition where age is not null;
OK
student_dynamic_partition.id    student_dynamic_partition.name  student_dynamic_partition.likes student_dynamic_partition.address   student_dynamic_partition.age   student_dynamic_partition.gender
3   小兰  ["吃鸡","book","movie"] {"chongqing":"renminglu","shenzheng":"futian"}  21  female
8   肚皮  ["walking","book","movie"]    {"nanchang":"renminglu","guangzhou":"niwan"}    21  man
1   小红  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
7   蓝宝  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
9   狗蛋  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
2   明明  ["王者","book","movie"] {"modu":"renminglu","xizhang":"lasha"}  22  man
5   悟空  ["walking","book","movie"]    {"modu":"renminglu","shenzheng":"futian"}   22  man
6   和尚  ["王者","book","movie"] {"nanchang":"renminglu","shenzheng":"futian"}   26  female
4   花花  ["王者","book","movie"] {"modu":"renminglu","dongguang":"changan"}  28  female
10  赵二  ["王者","book","movie"] {"shanghai":"renminglu","shenzheng":"futian"}   28  female
Time taken: 1.583 seconds, Fetched: 10 row(s)
hive> 3、order by语句必须包含limit输出限制
hive> select * from student_dynamic_partition where age is not null order by age desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703100542_32ded58c-12e2-4e8d-8754-ae7b75cd51a3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
.........................................
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.75 sec   HDFS Read: 13632 HDFS Write: 554 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 750 msec
OK
student_dynamic_partition.id    student_dynamic_partition.name  student_dynamic_partition.likes student_dynamic_partition.address   student_dynamic_partition.age   student_dynamic_partition.gender
10  赵二  ["王者","book","movie"] {"shanghai":"renminglu","shenzheng":"futian"}   28  female
4   花花  ["王者","book","movie"] {"modu":"renminglu","dongguang":"changan"}  28  female
6   和尚  ["王者","book","movie"] {"nanchang":"renminglu","shenzheng":"futian"}   26  female
5   悟空  ["walking","book","movie"]    {"modu":"renminglu","shenzheng":"futian"}   22  man
9   狗蛋  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
Time taken: 32.284 seconds, Fetched: 5 row(s)

hive排序

1、Order By - 对于查询结果做全排序，只允许有一个reduce处理。(当数据量较大时，应慎用。严格模式下，必须结合limit来使用) 不推荐用

2、Sort By - 对于单个reduce的数据进行排序

3、Distribute By - 分区排序，经常和Sort By结合使用（因为对于每个分区中单个reduce是有序的无法保证全部分区合并后还是有序，所以一般还需加上分区排序来保证全局reduce时是有序的）

4、Cluster By - 相当于 Sort By + Distribute By (Cluster By不能通过asc、desc的方式指定排序规则；可通过 distribute by column sort by column asc|desc 的方式) 不推荐用

1、通过 distribute by column sort by column asc|desc 的方式
hive>  select * from student_dynamic_partition where age is not null distribute by name sort by age desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703102044_a44490a9-46f7-4575-bfca-037a4be915f5
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
..........................................
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.03 sec   HDFS Read: 12365 HDFS Write: 578 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.18 sec   HDFS Read: 7954 HDFS Write: 554 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 210 msec
OK
student_dynamic_partition.id    student_dynamic_partition.name  student_dynamic_partition.likes student_dynamic_partition.address   student_dynamic_partition.age   student_dynamic_partition.gender
4   花花  ["王者","book","movie"] {"modu":"renminglu","dongguang":"changan"}  28  female
10  赵二  ["王者","book","movie"] {"shanghai":"renminglu","shenzheng":"futian"}   28  female
6   和尚  ["王者","book","movie"] {"nanchang":"renminglu","shenzheng":"futian"}   26  female
9   狗蛋  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
5   悟空  ["walking","book","movie"]    {"modu":"renminglu","shenzheng":"futian"}   22  man
Time taken: 62.222 seconds, Fetched: 5 row(s)

Hive Join 官方文档

1、Join计算时，将小表（驱动表）放在join的左边

2、Map Join：在Map端完成Join，两种实现方式如下：(MAP端完成JOIN的方式，将小表（记录大小较少而不是记录数较少）加载到内存再读取大表的数据进行JOIN来实现，这样在reduce端不需要在进行join了)

A、SQL方式，在SQL语句中添加MapJoin标记（mapjoin hint）

语法：SELECT /*+ MAPJOIN(smallTable) */ smallTable.key, bigTable.value

FROM smallTable JOIN bigTable ON smallTable.key = bigTable.key;

B、开启自动的MapJoin：

通过修改以下配置启用自动的mapjoin：set hive.auto.convert.join = true;（该参数为true时，Hive自动对左边的表统计量，如果是小表就加入内存，即对小表使用Map join）

其他相关参数配置：

hive.mapjoin.smalltable.filesize;(大表小表判断的阈值，如果表的大小小于该值则会被加载到内存中运行)

hive.ignore.mapjoin.hint;(默认值：true；是否忽略mapjoin hint 即mapjoin标记)

hive.auto.convert.join.noconditionaltask;(默认值：true；将普通的join转化为普通的mapjoin时，是否将多个mapjoin转化为一个mapjoin)

hive.auto.convert.join.noconditionaltask.size;(将多个mapjoin转化为一个mapjoin时，其表的最大值)

3、尽可能使用相同的连接键（会转化为一个MapReduce作业）

4、大表join大表

空key过滤：有时join超时是因为某些key对应的数据太多，而相同key对应的数据都会发送到相同的reducer上，从而导致内存不够。此时我们应该仔细分析这些异常的key，很多情况下，这些key对应的数据是异常数据，我们需要在SQL语句中进行过滤

空key转换：有时虽然某个key为空对应的数据很多，但是相应的数据不是异常数据，必须要包含在join的结果中，此时我们可以表a中key为空的字段赋一个随机的值，使得数据随机均匀地分不到不同的reducer上

Map-Side聚合

通过设置以下参数开启在Map端的聚合（类似与HDFS中的combiner）：set hive.map.aggr=true;

Hive 优化（通用版）相关推荐

大数据知识面试题-Hive （2022版）
序列号内容链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...
hive 行转列和列转行的方法_读离线和实时大数据开发实战，为你揭开 Hive 优化实践的神秘面纱...
前言「1024,1GB,一级棒!程序仔们节日快乐!」 ❝ 指尖流动的 1024 行代码,到底是什么? ❞ ❝ 是10行的迷茫?是101行的叛逆?是202行的理性思考?是307行对渴望的冲动?还是40 ...
移动云正式发布基于龙蜥 Anolis OS 的 BC-Linux V8.2 通用版操作系统
简介: 2020年12月CentOS项目组宣布CentOS 8将于2021年12月31日结束支持,这意味着从2022年开始,使用CentOS 8的用户,将无法得到来自官方的新硬件支持.bug修复和安全 ...
hive 优化（二）
在讨论hive优化之前,我们需要知道的是HQL它的执行过程. 简单的说,HQL会最终转化为job,然后通过MR来执行job 问题一既然HQL会转化为JOB,那么如果job数量太多,会不会对hive执 ...
民科微服务电脑版下载_天翼云桌面通用版电脑版下载|天翼云桌面通用版PC客户端 V1.23.0 官方最新版下载_当下软件园...
天翼云桌面通用版电脑版是一款由中国电信股份有限公司云计算分公司所推出的云终端桌面平台.该版本是通过在电脑上安装安卓模拟器来实现运行的,以此达到PC端的使用效果,基于特有的通信协议,通过云终端将桌面或应 ...
java 浏览器 qq_qq浏览器通用版手机QQ浏览器v2.1Java通用版下载
qq浏览器通用版手机QQ浏览器v2.1Java通用版下载手机QQ浏览器v2.1通用版软件类型: 手机浏览器适用手机: java 软件大小: 591 KB 更新日期: 2011-06-12 浏览 ...
CC00027.hadoop——|HadoopHive.V27|——|Hive.v27|Hive优化策略|实战.v03|
一.SQL优化 ### --- SQL优化~~~ 列裁剪和分区裁剪 ~~~ 列裁剪是在查询时只读取需要的列:分区裁剪就是只读取需要的分区. ~~~ 简单的说:select 中不要有多余的列,坚决避免 ...
【系统之家首发】Ghost_Windows7_sp1_Ultimate_x86V2011.10.10 【OEM 通用版】Windows7旗舰版好人一个出品
Windows7旗舰版Ghost_Windows7_sp1_Ultimate_x86V2011.10.10 [OEM 通用版](好人一个出品) 更新说明: 1.系统补丁更新至2011年10月10日 2 ...
java 浏览器 qq_Qq浏览器通用版移动QQ浏览器v2.1Java通用版下载
下载地址: 移动QQ浏览器的Java版本运行平稳. 您可以通过长按页面上的空白来调用菜单,这更加方便. 新增了由腾讯移动客户端软件创建的夜间模式,对多项功能进行了优化和创新. 设置了QQ空间应用程序入 ...

Hive 优化（通用版）

Hive 优化（通用版）相关推荐

最新文章

热门文章