hive优化

Hive 优化核心思想:把Hive SQL 当做Mapreduce程序去优化

以下SQL不会转为Mapreduce来执行:

select仅查询本表字段

where仅对本表字段做条件过滤

Explain 显示执行计划:EXPLAIN [EXTENDED] query

hive> explain extended select * from student;
OK
Explain
STAGE DEPENDENCIES:Stage-0 is a root stageSTAGE PLANS:Stage: Stage-0Fetch Operatorlimit: -1Processor Tree:TableScanalias: studentStatistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: NONEGatherStats: falseSelect Operatorexpressions: id (type: int), name (type: string), likes (type: array<string>), address (type: map<string,string>)outputColumnNames: _col0, _col1, _col2, _col3Statistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: NONEListSinkTime taken: 0.231 seconds, Fetched: 18 row(s)========================================================hive> explain extended select count(*) from student;
OK
Explain
STAGE DEPENDENCIES:Stage-1 is a root stageStage-0 depends on stages: Stage-1STAGE PLANS:Stage: Stage-1Map ReduceMap Operator Tree:TableScanalias: studentStatistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: COMPLETEGatherStats: falseSelect OperatorStatistics: Num rows: 1 Data size: 618 Basic stats: COMPLETE Column stats: COMPLETEGroup By Operatoraggregations: count()mode: hashoutputColumnNames: _col0Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEReduce Output Operatornull sort order: sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEtag: -1value expressions: _col0 (type: bigint)auto parallelism: falsePath -> Alias:hdfs://mycluster/user/hive/warehouse/student [student]Path -> Partition:hdfs://mycluster/user/hive/warehouse/student Partitionbase file name: studentinput format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatproperties:bucket_count -1colelction.delim -column.name.delimiter ,columns id,name,likes,addresscolumns.comments columns.types int:string:array<string>:map<string,string>field.delim ,file.inputformat org.apache.hadoop.mapred.TextInputFormatfile.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatlocation hdfs://mycluster/user/hive/warehouse/studentmapkey.delim :name default.studentnumFiles 1numRows 0rawDataSize 0serialization.ddl struct student { i32 id, string name, list<string> likes, map<string,string> address}serialization.format ,serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDetotalSize 618transient_lastDdlTime 1624695643serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeinput format: org.apache.hadoop.mapred.TextInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatproperties:bucket_count -1colelction.delim -column.name.delimiter ,columns id,name,likes,addresscolumns.comments columns.types int:string:array<string>:map<string,string>field.delim ,file.inputformat org.apache.hadoop.mapred.TextInputFormatfile.outputformat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormatlocation hdfs://mycluster/user/hive/warehouse/studentmapkey.delim :name default.studentnumFiles 1numRows 0rawDataSize 0serialization.ddl struct student { i32 id, string name, list<string> likes, map<string,string> address}serialization.format ,serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDetotalSize 618transient_lastDdlTime 1624695643serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDename: default.studentname: default.studentTruncated Path -> Alias:/student [student]Needs Tagging: falseReduce Operator Tree:Group By Operatoraggregations: count(VALUE._col0)mode: mergepartialoutputColumnNames: _col0Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEFile Output Operatorcompressed: falseGlobalTableId: 0directory: hdfs://mycluster/tmp/hive/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-26-39_045_671258483012844310-1/-mr-10001/.hive-staging_hive_2021-07-03_09-26-39_045_671258483012844
310-1/-ext-10002            NumFilesPerFileSink: 1Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETEStats Publishing Key Prefix: hdfs://mycluster/tmp/hive/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-26-39_045_671258483012844310-1/-mr-10001/.hive-staging_hive_2021-07-03_09-26-39_0
45_671258483012844310-1/-ext-10002/            table:input format: org.apache.hadoop.mapred.SequenceFileInputFormatoutput format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormatproperties:columns _col0columns.types bigintescape.delim \hive.serialization.extend.additional.nesting.levels trueserialization.escape.crlf trueserialization.format 1serialization.lib org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeserde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeTotalFiles: 1GatherStats: falseMultiFileSpray: falseStage: Stage-0Fetch Operatorlimit: -1Processor Tree:ListSinkTime taken: 0.183 seconds, Fetched: 121 row(s)

Hive抓取策略

Hive中对某些情况的查询不需要使用MapReduce计算

set hive.fetch.task.conversion=none/more;

1、hive的默认抓取策略是more,如果抓取策略设置为none则每次都需要执行MapReduce操作
hive> set hive.fetch.task.conversion;
hive.fetch.task.conversion=more
hive> set hive.fetch.task.conversion=none;
hive> select id,name from student;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703091722_537a3295-d844-49a4-93d7-9160cb501ba9
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1625270715754_0005, Tracking URL = http://node03:8088/proxy/application_1625270715754_0005/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0005
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2021-07-03 09:17:34,420 Stage-1 map = 0%,  reduce = 0%
2021-07-03 09:17:44,689 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.46 sec
MapReduce Total cumulative CPU time: 1 seconds 460 msec
Ended Job = job_1625270715754_0005
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1   Cumulative CPU: 1.46 sec   HDFS Read: 4829 HDFS Write: 369 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 460 msec
OK
id  name
1   小红1
2   小红2
3   小红3
4   小红4
5   小红5
6   小红6
7   小红7
8   小红8
9   小红9
10  小红10
Time taken: 23.477 seconds, Fetched: 10 row(s)

hive运行方式:本地模式、集群模式(默认)

1、 本地模式:

开启本地模式:set hive.exec.mode.local.auto=true;

注意:hive.exec.mode.local.auto.inputbytes.max默认值为128M,表示加载文件的最大值,若大于该配置仍会以集群方式来运行!

1、查看hive运行模式,默认集群
hive> set hive.exec.mode.local.auto;
hive.exec.mode.local.auto=false
hive>2、查看本地模式最大文件大小默认128M
hive> set hive.exec.mode.local.auto.inputbytes.max;
hive.exec.mode.local.auto.inputbytes.max=134217728
hive>

并行计算

通过设置以下参数开启并行模式:set hive.exec.parallel=true;

注意:hive.exec.parallel.thread.number; 一次SQL计算中允许并行执行的job个数的最大值

1、并行计算默认关闭,默认并行计算线程数为8
hive> set hive.exec.parallel;
hive.exec.parallel=false
hive> set hive.exec.parallel.thread.number;
hive.exec.parallel.thread.number=8
hive> 2、不开启并行计算下的hive运行
hive> set hive.strict.checks.cartesian.product;
hive.strict.checks.cartesian.product=true
hive> set hive.strict.checks.cartesian.product=false;
hive> select t1.c1,t2.age from (select count(name) c1 from student) t1,(select count(*) age from student) t2;
Warning: Map Join MAPJOIN[21][bigTable=?] in task 'Stage-4:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=?] in task 'Stage-5:MAPRED' is a cross product
Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703094655_d4c2384a-17b5-48c7-b997-74bfda966179
Total jobs = 5
Launching Job 1 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1625270715754_0006, Tracking URL = http://node03:8088/proxy/application_1625270715754_0006/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-07-03 09:47:07,363 Stage-1 map = 0%,  reduce = 0%
2021-07-03 09:47:17,905 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.8 sec
2021-07-03 09:47:28,908 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.32 sec
MapReduce Total cumulative CPU time: 3 seconds 320 msec
Ended Job = job_1625270715754_0006
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1625270715754_0007, Tracking URL = http://node03:8088/proxy/application_1625270715754_0007/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0007
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2021-07-03 09:47:44,525 Stage-3 map = 0%,  reduce = 0%
2021-07-03 09:47:54,739 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 2.07 sec
2021-07-03 09:48:02,938 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 3.56 sec
MapReduce Total cumulative CPU time: 3 seconds 560 msec
Ended Job = job_1625270715754_0007
Stage-7 is selected by condition resolver.
Stage-8 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
2021-07-03 09:48:15 Starting to launch local task to process map join;  maximum memory = 518979584
2021-07-03 09:48:17 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-46-55_555_9027395853760823729-1/-local-10006/HashTab
le-Stage-4/MapJoin-mapfile01--.hashtable2021-07-03 09:48:17 Uploaded 1 File to: file:/tmp/root/e7e5657c-f366-4391-a117-3838e2f530ba/hive_2021-07-03_09-46-55_555_9027395853760823729-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (
278 bytes)2021-07-03 09:48:17   End of local task; Time Taken: 1.544 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1625270715754_0008, Tracking URL = http://node03:8088/proxy/application_1625270715754_0008/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0008
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2021-07-03 09:48:28,131 Stage-4 map = 0%,  reduce = 0%
2021-07-03 09:48:35,279 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 1.36 sec
MapReduce Total cumulative CPU time: 1 seconds 360 msec
Ended Job = job_1625270715754_0008
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.32 sec   HDFS Read: 8497 HDFS Write: 114 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 3.56 sec   HDFS Read: 8383 HDFS Write: 114 SUCCESS
Stage-Stage-4: Map: 1   Cumulative CPU: 1.36 sec   HDFS Read: 4258 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 8 seconds 240 msec
OK
t1.c1   t2.age
10  10
Time taken: 100.836 seconds, Fetched: 1 row(s)3、设置并行计算
set hive.exec.parallel=true;
hive> select t1.c1,t2.age from (select count(name) c1 from student) t1,(select count(*) age from student) t2;
Warning: Map Join MAPJOIN[21][bigTable=?] in task 'Stage-4:MAPRED' is a cross product
Warning: Map Join MAPJOIN[22][bigTable=?] in task 'Stage-5:MAPRED' is a cross product
Warning: Shuffle Join JOIN[14][tables = [$hdt$_0, $hdt$_1]] in Stage 'Stage-2:MAPRED' is a cross product
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703095117_dc872c31-34d2-4f84-a7ae-d082e8bfa55d
Total jobs = 5
Launching Job 1 out of 5
Launching Job 2 out of 5
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
Starting Job = job_1625270715754_0010, Tracking URL = http://node03:8088/proxy/application_1625270715754_0010/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0010
Starting Job = job_1625270715754_0009, Tracking URL = http://node03:8088/proxy/application_1625270715754_0009/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0009
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2021-07-03 09:51:31,266 Stage-3 map = 0%,  reduce = 0%
2021-07-03 09:51:39,621 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.35 sec
2021-07-03 09:51:47,881 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 2.69 sec
MapReduce Total cumulative CPU time: 2 seconds 690 msec
Ended Job = job_1625270715754_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2021-07-03 09:52:02,620 Stage-1 map = 0%,  reduce = 0%
2021-07-03 09:52:10,052 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.47 sec
2021-07-03 09:52:18,494 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 2.91 sec
MapReduce Total cumulative CPU time: 2 seconds 910 msec
Ended Job = job_1625270715754_0010
Stage-7 is selected by condition resolver.
Stage-8 is filtered out by condition resolver.
Stage-2 is filtered out by condition resolver.
2021-07-03 09:52:31 Starting to launch local task to process map join;  maximum memory = 518979584
2021-07-03 09:52:33 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/root/bcba216b-18cc-4360-8df6-28741fc1277f/hive_2021-07-03_09-51-17_847_4787078123986205503-1/-local-10006/HashTab
le-Stage-4/MapJoin-mapfile01--.hashtable2021-07-03 09:52:33 Uploaded 1 File to: file:/tmp/root/bcba216b-18cc-4360-8df6-28741fc1277f/hive_2021-07-03_09-51-17_847_4787078123986205503-1/-local-10006/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (
278 bytes)2021-07-03 09:52:33   End of local task; Time Taken: 1.538 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 4 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1625270715754_0011, Tracking URL = http://node03:8088/proxy/application_1625270715754_0011/
Kill Command = /opt/software/hadoop-2.10.1/bin/hadoop job  -kill job_1625270715754_0011
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2021-07-03 09:52:43,157 Stage-4 map = 0%,  reduce = 0%
2021-07-03 09:52:50,333 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
MapReduce Total cumulative CPU time: 1 seconds 340 msec
Ended Job = job_1625270715754_0011
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 2.69 sec   HDFS Read: 8383 HDFS Write: 114 SUCCESS
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 2.91 sec   HDFS Read: 8497 HDFS Write: 114 SUCCESS
Stage-Stage-4: Map: 1   Cumulative CPU: 1.34 sec   HDFS Read: 4258 HDFS Write: 105 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 940 msec
OK
t1.c1   t2.age
10  10
Time taken: 94.155 seconds, Fetched: 1 row(s)

严格模式

通过设置以下参数开启严格模式:set hive.mapred.mode=strict;(默认为非严格模式nonstrict)

启用严格模式后,会对表的查询进行如下限制:

1、对于分区表,必须添加where对于分区字段的条件过滤;

2、order by语句必须包含limit输出限制;

3、限制执行笛卡尔积的查询。

1、默认未开启严格模式
hive> set hive.mapred.mode;
hive.mapred.mode is undefined
hive> set hive.mapred.mode=strict;
hive> 2、对于分区表,必须添加where对于分区字段的条件过滤
hive> select * from student_dynamic_partition;
FAILED: SemanticException Queries against partitioned tables without a partition filter are disabled for safety reasons. If you know what you are doing, please sethive.strict.checks.large.query to false and th
at hive.mapred.mode is not set to 'strict' to proceed. Note that if you may get errors or incorrect results if you make a mistake while using some of the unsafe features. No partition predicate for Alias "student_dynamic_partition" Table "student_dynamic_partition"hive> select * from student_dynamic_partition where age is not null;
OK
student_dynamic_partition.id    student_dynamic_partition.name  student_dynamic_partition.likes student_dynamic_partition.address   student_dynamic_partition.age   student_dynamic_partition.gender
3   小兰  ["吃鸡","book","movie"] {"chongqing":"renminglu","shenzheng":"futian"}  21  female
8   肚皮  ["walking","book","movie"]    {"nanchang":"renminglu","guangzhou":"niwan"}    21  man
1   小红  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
7   蓝宝  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
9   狗蛋  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
2   明明  ["王者","book","movie"] {"modu":"renminglu","xizhang":"lasha"}  22  man
5   悟空  ["walking","book","movie"]    {"modu":"renminglu","shenzheng":"futian"}   22  man
6   和尚  ["王者","book","movie"] {"nanchang":"renminglu","shenzheng":"futian"}   26  female
4   花花  ["王者","book","movie"] {"modu":"renminglu","dongguang":"changan"}  28  female
10  赵二  ["王者","book","movie"] {"shanghai":"renminglu","shenzheng":"futian"}   28  female
Time taken: 1.583 seconds, Fetched: 10 row(s)
hive> 3、order by语句必须包含limit输出限制
hive> select * from student_dynamic_partition where age is not null order by age desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703100542_32ded58c-12e2-4e8d-8754-ae7b75cd51a3
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
.........................................
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.75 sec   HDFS Read: 13632 HDFS Write: 554 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 750 msec
OK
student_dynamic_partition.id    student_dynamic_partition.name  student_dynamic_partition.likes student_dynamic_partition.address   student_dynamic_partition.age   student_dynamic_partition.gender
10  赵二  ["王者","book","movie"] {"shanghai":"renminglu","shenzheng":"futian"}   28  female
4   花花  ["王者","book","movie"] {"modu":"renminglu","dongguang":"changan"}  28  female
6   和尚  ["王者","book","movie"] {"nanchang":"renminglu","shenzheng":"futian"}   26  female
5   悟空  ["walking","book","movie"]    {"modu":"renminglu","shenzheng":"futian"}   22  man
9   狗蛋  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
Time taken: 32.284 seconds, Fetched: 5 row(s)

hive排序

1、Order By - 对于查询结果做全排序,只允许有一个reduce处理。(当数据量较大时,应慎用。严格模式下,必须结合limit来使用) 不推荐用

2、Sort By - 对于单个reduce的数据进行排序

3、Distribute By - 分区排序,经常和Sort By结合使用(因为对于每个分区中单个reduce是有序的无法保证全部分区合并后还是有序,所以一般还需加上分区排序来保证全局reduce时是有序的)

4、Cluster By - 相当于 Sort By + Distribute By (Cluster By不能通过asc、desc的方式指定排序规则;可通过 distribute by column sort by column asc|desc 的方式) 不推荐用

1、通过 distribute by column sort by column asc|desc 的方式
hive>  select * from student_dynamic_partition where age is not null distribute by name sort by age desc limit 5;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = root_20210703102044_a44490a9-46f7-4575-bfca-037a4be915f5
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:set mapreduce.job.reduces=<number>
..........................................
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 3.03 sec   HDFS Read: 12365 HDFS Write: 578 SUCCESS
Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 3.18 sec   HDFS Read: 7954 HDFS Write: 554 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 210 msec
OK
student_dynamic_partition.id    student_dynamic_partition.name  student_dynamic_partition.likes student_dynamic_partition.address   student_dynamic_partition.age   student_dynamic_partition.gender
4   花花  ["王者","book","movie"] {"modu":"renminglu","dongguang":"changan"}  28  female
10  赵二  ["王者","book","movie"] {"shanghai":"renminglu","shenzheng":"futian"}   28  female
6   和尚  ["王者","book","movie"] {"nanchang":"renminglu","shenzheng":"futian"}   26  female
9   狗蛋  ["王者","book","movie"] {"modu":"renminglu","shenzheng":"futian"}   22  female
5   悟空  ["walking","book","movie"]    {"modu":"renminglu","shenzheng":"futian"}   22  man
Time taken: 62.222 seconds, Fetched: 5 row(s)

Hive Join 官方文档

1、Join计算时,将小表(驱动表)放在join的左边

2、Map Join:在Map端完成Join,两种实现方式如下:(MAP端完成JOIN的方式,将小表(记录大小较少而不是记录数较少)加载到内存再读取大表的数据进行JOIN来实现,这样在reduce端不需要在进行join了)

A、SQL方式,在SQL语句中添加MapJoin标记(mapjoin hint)

语法:SELECT  /*+ MAPJOIN(smallTable) */  smallTable.key,  bigTable.value

FROM  smallTable  JOIN  bigTable  ON  smallTable.key  =  bigTable.key;

B、开启自动的MapJoin:

通过修改以下配置启用自动的mapjoin:set hive.auto.convert.join = true;(该参数为true时,Hive自动对左边的表统计量,如果是小表就加入内存,即对小表使用Map join)

其他相关参数配置:

hive.mapjoin.smalltable.filesize;(大表小表判断的阈值,如果表的大小小于该值则会被加载到内存中运行)

hive.ignore.mapjoin.hint;(默认值:true;是否忽略mapjoin hint 即mapjoin标记)

hive.auto.convert.join.noconditionaltask;(默认值:true;将普通的join转化为普通的mapjoin时,是否将多个mapjoin转化为一个mapjoin)

hive.auto.convert.join.noconditionaltask.size;(将多个mapjoin转化为一个mapjoin时,其表的最大值)

3、尽可能使用相同的连接键(会转化为一个MapReduce作业)

4、大表join大表

空key过滤:有时join超时是因为某些key对应的数据太多,而相同key对应的数据都会发送到相同的reducer上,从而导致内存不够。此时我们应该仔细分析这些异常的key,很多情况下,这些key对应的数据是异常数据,我们需要在SQL语句中进行过滤

空key转换:有时虽然某个key为空对应的数据很多,但是相应的数据不是异常数据,必须要包含在join的结果中,此时我们可以表a中key为空的字段赋一个随机的值,使得数据随机均匀地分不到不同的reducer上

Map-Side聚合

通过设置以下参数开启在Map端的聚合(类似与HDFS中的combiner):set hive.map.aggr=true;

相关配置参数:

hive.groupby.mapaggr.checkinterval;(map端group by执行聚合时处理的多少行数据:默认100000)

hive.map.aggr.hash.min.reduction;进行聚合的最小比例(预先对100000条数据做聚合,若聚合之后的数据量/100000的值大于该配置0.5,则不会聚合)

hive.map.aggr.hash.percentmemory;map端聚合使用的内存的最大值

hive.map.aggr.hash.force.flush.memory.threshold;map端做聚合操作是hash表的最大可用内容,大于该值则会触发flush

hive.groupby.skewindata; 是否对GroupBy产生的数据倾斜做优化,默认为false

合并小文件

原因:文件数目小,容易在文件存储端造成压力,给hdfs造成压力,影响效率

设置合并属性:

1、是否合并map输出文件:hive.merge.mapfiles=true

2、是否合并reduce输出文件:hive.merge.mapredfiles=true

3、合并文件的大小:hive.merge.size.per.task=256*1000*1000

去重统计

数据量小的时候无所谓,数据量大的情况下,由于COUNT DISTINCT操作需要用一个Reduce Task来完成,这一个Reduce需要处理的数据量太大,就会导致整个Job很难完成,一般COUNT DISTINCT使用先GROUP BY再COUNT的方式替换

控制Hive中Map以及Reduce的数量

Map数量相关的参数:

1、mapred.max.split.size:一个split的最大值,即每个map处理文件的最大值

2、mapred.min.split.size.per.node:一个节点上split的最小值

3、mapred.min.split.size.per.rack:一个机架上split的最小值

Reduce数量相关的参数

1、mapred.reduce.tasks:强制指定reduce任务的数量

2、hive.exec.reducers.bytes.per.reducer:每个reduce任务处理的数据量

3、hive.exec.reducers.max:每个任务最大的reduce数

Hive - JVM重用

使用场景:小文件个数过多、task个数过多

设置方式:set mapred.job.reuse.jvm.num.tasks=n; (n为task插槽个数)

缺点:设置开启之后,task插槽会一直占用资源,不论是否有task运行,直到所有的task即整个job全部执行完成时,才会释放所有的task插槽资源

Hive 优化(通用版)相关推荐

  1. 大数据知识面试题-Hive (2022版)

    序列号 内容 链接 1 大数据知识面试题-通用(2022版) https://blog.csdn.net/qq_43061290/article/details/124819089 2 大数据知识面试 ...

  2. hive 行转列和列转行的方法_读离线和实时大数据开发实战,为你揭开 Hive 优化实践的神秘面纱...

    前言 「1024,1GB,一级棒!程序仔们节日快乐!」 ❝ 指尖流动的 1024 行代码,到底是什么? ❞ ❝ 是10行的迷茫?是101行的叛逆?是202行的理性思考?是307行对渴望的冲动?还是40 ...

  3. 移动云正式发布基于龙蜥 Anolis OS 的 BC-Linux V8.2 通用版操作系统

    简介: 2020年12月CentOS项目组宣布CentOS 8将于2021年12月31日结束支持,这意味着从2022年开始,使用CentOS 8的用户,将无法得到来自官方的新硬件支持.bug修复和安全 ...

  4. hive 优化(二)

    在讨论hive优化之前,我们需要知道的是HQL它的执行过程. 简单的说,HQL会最终转化为job,然后通过MR来执行job 问题一 既然HQL会转化为JOB,那么如果job数量太多,会不会对hive执 ...

  5. 民科微服务电脑版下载_天翼云桌面通用版电脑版下载|天翼云桌面通用版PC客户端 V1.23.0 官方最新版 下载_当下软件园...

    天翼云桌面通用版电脑版是一款由中国电信股份有限公司云计算分公司所推出的云终端桌面平台.该版本是通过在电脑上安装安卓模拟器来实现运行的,以此达到PC端的使用效果,基于特有的通信协议,通过云终端将桌面或应 ...

  6. java 浏览器 qq_qq浏览器通用版 手机QQ浏览器v2.1Java通用版下载

    qq浏览器通用版 手机QQ浏览器v2.1Java通用版下载 手机QQ浏览器v2.1通用版 软件类型: 手机浏览器 适用手机: java 软件大小: 591 KB 更新日期: 2011-06-12 浏览 ...

  7. CC00027.hadoop——|HadoopHive.V27|——|Hive.v27|Hive优化策略|实战.v03|

    一.SQL优化 ### --- SQL优化~~~ 列裁剪和分区裁剪 ~~~ 列裁剪是在查询时只读取需要的列:分区裁剪就是只读取需要的分区. ~~~ 简单的说:select 中不要有多余的列,坚决避免 ...

  8. 【系统之家首发】Ghost_Windows7_sp1_Ultimate_x86V2011.10.10 【OEM 通用版】Windows7旗舰版好人一个出品

    Windows7旗舰版Ghost_Windows7_sp1_Ultimate_x86V2011.10.10 [OEM 通用版](好人一个出品) 更新说明: 1.系统补丁更新至2011年10月10日 2 ...

  9. java 浏览器 qq_Qq浏览器通用版移动QQ浏览器v2.1Java通用版下载

    下载地址: 移动QQ浏览器的Java版本运行平稳. 您可以通过长按页面上的空白来调用菜单,这更加方便. 新增了由腾讯移动客户端软件创建的夜间模式,对多项功能进行了优化和创新. 设置了QQ空间应用程序入 ...

最新文章

  1. Mozilla开源了VR框架A-Frame
  2. 三公子论「财务自由」
  3. VLC for android 编译错误
  4. mysql1526错误_mysql 分区 1526错误
  5. 为什么MySQL索引更适合B+树而不是二叉树、B树
  6. 3D建模突然火起来了之后应该如何面对?
  7. 网易被曝暴力裁患绝症员工,回应:存在不近人情的地方,向前同事道歉
  8. linux提示有新邮件,/var/spool/mail/root 中有新邮件 解决方法
  9. 3D人体姿态估计总结
  10. unity 获取 AudioSource 分贝值
  11. 港口信息化、智能化、自动化产品设计想法---5
  12. 理解加载class到JVM的时机
  13. 【友盟+】于晓航:大数据“格物致知”
  14. 三城记:中国创客地图
  15. matlab如何把Excel数据合并,《matlab怎么合并excel单元格并赋值?》 matlab合并 excel表格数据...
  16. Mysql为什么使用B+树(一)之红黑树简述
  17. 一盒两用——破解移动IPTV机顶盒为两用安卓机顶盒
  18. 计算机硬件系统概念,计算机系统概念
  19. JS类教程 Lynda中文
  20. 软件的注册码及清除电脑垃圾的文件

热门文章

  1. 21天好习惯第一期-18
  2. 跨时钟域信号处理(二)——异步fifo的Verilog实现(附同步fifo的实现)
  3. iOS-通俗易懂的微信支付接入和爬坑指南,十分钟轻松搞完
  4. Echarts中国地图与世界地图实战
  5. 大华DSS视频综合应用平台webservice接口使用手册-php测试用例
  6. UX设计师是做什么的,现在怎么样
  7. 5、♥☆基于STM32的智能手环√★☆
  8. 自媒体赚钱系列连载03:音乐人有收益自媒体平台大全
  9. 2020年10月份电脑选购计划
  10. Office2016的安装进度在 90% 时挂起