文章目录

1.mapreduce的过程上图
2.map 切分输入文件
3.环形缓冲区
- 3.1 原理
- 3.2 生产调优
- - 3.2.1 mapreduce.task.io.sort.mb(default:100m)
  - 3.2.2 mapreduce.map.sort.spill.percent(default:0.80)
4.数据在spill到磁盘之前会做partition,sort操作
- 4.1 原理
- 4.2 生产调优
5. 溢写到磁盘(spill to disk)
- 5.1 原理
- 5.2 生产调优参数
- - 5.2.1 mapreduce.task.io.sort.factor（default：10）
  - 5.2.2 mapreduce.map.combine.minspills （default：3）
  - 5.2.3 Map输出生产设置Snappy压缩
6. Shuffle操作
- 6.1 点开Reduce的源码注释,可以看到Shuffle --> Sort --> SecondarySort
- 6.2 Copy -- reduch fetch map端输出数据
- 6.3 MergeSort -- 重点 Merge的三种方式
- 6.4 生产调优参数
- - 6.4.1 mapreduce.job.reduce.slowstart.completedmaps(default:0.05)
  - 6.4.2 mapreduce.reduce.shuffle.parallelcopies (default:5)
  - 6.4.3 mapreduce.reduce.shuffle.read.timeout(default:180000ms)
  - 6.4.4 mapreduce.reduce.shuffle.input.buffer.percent(default:0.70)
  - 6.4.5 mapreduce.reduce.shuffle.merge.percent(default:0.66)
7. Reduce
- 7.1 Reduce 介绍
- 7.2 生产调优参数
- - 7.2.1 mapreduce.reduce.memory.mb(default:1024m)
  - 7.2.2 mapred.child.java.opts(dfault:-Xmx200m)
  - 7.3.3 Reduce端的输出压缩格式 Snappy 或 Lzo

1.mapreduce的过程上图

上两张比较好的图,下面详细讲解,看完详解再看这两张图片会有更深刻认识

2.map 切分输入文件

首先hdfs上的文件,通过map进行切分,决定map的数量.具体map的个数由什么决定? 查看我的另一篇博客:

   https://blog.csdn.net/lihuazaizheli/article/details/107580462//配置参数,详细源码介绍查看我的博客
mapreduce.input.fileinputformat.split.minsize //启动map最小的split size大小，默认0byte
mapreduce.input.fileinputformat.split.maxsize //启动map最大的split size大小，默认无限大
dfs.block.size  //block块大小，默认128M
计算公式：splitSize =  Math.max(minSize, Math.min(maxSize, blockSize));

上图中是有两个map,进入环形缓冲区.

3.环形缓冲区

3.1 原理

map的结果先放入缓冲区默认100M(其实先序列化)，当缓冲区的数据量达到阈值时(默认100M * 0.8 = 80M)，溢出行为会在一个后台线程执行开始spill操作。spill是将数据写入到磁盘。

# 环形缓冲区的理解
https://blog.csdn.net/qq_35468937/article/details/80669834
# map的spill理解
https://www.cnblogs.com/yesecangqiong/p/6283140.html
# MapReduce过程详解及其性能优化
https://www.jianshu.com/p/9e4d01b74600

3.2 生产调优

3.2.1 mapreduce.task.io.sort.mb(default:100m)

官网解释:The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks.

可以根据不同的硬件尤其是内存的大小来调整，调大的话，会减少磁盘spill的次数,这样减少了磁盘IO,加快了map处理速度,此时如果内存足够的话，一般都会显著提升性能。当调整这个参数时，最好同时检测Map任务的JVM的堆大小，并必要的时候增加堆空间。

<property><name>mapreduce.task.io.sort.mb</name><value>300</value><description>shuffle 的环形缓冲区大小，默认100m</description>
</property>

3.2.2 mapreduce.map.sort.spill.percent(default:0.80)

官网解释:The soft limit in the serialization buffer. Once reached, a thread will begin to spill the contents to disk in the background. Note that collection will not block if this threshold is exceeded while a spill is already in progress, so spills may be larger than this threshold when it is set to less than .5

spill一般会在Buffer空间大小的80%开始进行spill磁盘操作.可以调大该参数,减少spill操作

<property><name>mapreduce.map.sort.spill.percent</name><value>0.88</value><description>环形缓冲区溢出的阈值，默认80%</description>
</property>

4.数据在spill到磁盘之前会做partition,sort操作

4.1 原理

来一个自定义的分区排序就知道原理了

/*** @author 自定义分区继承Partitioner 就好了**/
public class PhonePartitioner extends Partitioner<Traffic,Text> {@Overridepublic int getPartition(Traffic traffic, Text text, int numPartitions) {String phone = text.toString();if(phone.startsWith("13")) {return 0;} else if(phone.startsWith("15")) {return 1;} else {return 2;}}
}实现WritableComparable接口或继承WritableComparator类可实现自定义排序
/*** 按照id升序分组* @author 实现WritableComparator 自定义升序**/
public class OrderGroupingComparator extends WritableComparator {// 一定要使用构造方法去调用父类的构造方法进行初始化public OrderGroupingComparator(){super(Order.class, true);}@Overridepublic int compare(WritableComparable a, WritableComparable b) {Order order1 = (Order)a;Order order2 = (Order)b;int result;if(order1.getId() > order2.getId()) {result = 1;} else if(order1.getId() < order2.getId()) {result = -1;} else {result = 0;}return result;}
}

4.2 生产调优

 上图中spill到磁盘的紫色,绿色,橙色数据: 分别是3个分区,并默认是按照自然序升序.自定义分区和排序此阶段无需调优,只需要理解 分区是 hash算法就行.

5. 溢写到磁盘(spill to disk)

5.1 原理

前面讲过环形缓冲区的数据溢写到磁盘由两个参数控制.这里讲讲merge:

Map Task在计算的时候会不断产生很多spill文件，在Map Task结束前会对这些spill文件进行合并，这些文件会根据情况合并到一个大的分区的、排序的文件中,这个过程就是merge的过程. 图中的蓝色,紫色,橙色分区内的数据进行聚合成一个大的文件.

Merge中有一个重要的调优方式,就是本地聚合Combiner.

spill是到磁盘,谈到写数据到磁盘,就可以联想到数据压缩.在数据量大的时候，对map输出要进行压缩。启用压缩，将mapreduce.map.output.compress设为true，并使用mapreduce.map.output.compress.codec设置使用的压缩算法。

5.2 生产调优参数

5.2.1 mapreduce.task.io.sort.factor（default：10）

 官网解释: The number of streams to merge at once while sorting files. This determines the number of open file handles

此参数代表进行merge的时候最多能同时merge多少spill，如果有100个spill个文件，此时就无法一次完成整个merge的过程，这个时候需要调大mapreduce.task.io.sort.factor（default：10）来减少merge的次数，从而减少磁盘的操作；

评价:此参数生产一般不用调整

5.2.2 mapreduce.map.combine.minspills （default：3）

  Combiner操作和Map在一个JVM中，是由min.num.spill.for.combine的参数决定的，默认是3，也就是说spill的文件数在默认情况下由三个的时候就要进行combine操作，最终减少磁盘数据；

评价: 这个是网上其他博主写的调优参数,在hadoop2.X的官当文档中已经找不到此参数.只找到下列参数

 mapreduce.task.combine.progress.records (default:10000)官网解释:The number of records to process during combine output collection before sending a progress notification.

5.2.3 Map输出生产设置Snappy压缩

<property><name>mapreduce.map.output.compress</name><value>true</value><description>MAP输出压缩</description>
</property><property><name>mapreduce.map.output.compress.codec</name><value>org.apache.hadoop.io.compress.SnappyCodec</value><description>压缩类org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.DefaultCodec, Default org.apache.hadoop.io.compress.BZip2Codec, com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.Lz4Codec, org.apache.hadoop.io.compress.SnappyCodec,</description>
</property>评价: 大数据的数据压缩格式一般有Snappy,Lzo,Gzip,Bzip2等,这里选择Snappy是因为其有压缩速度快,压缩率较高的特点(注意其不支持split),具体对比可查看下面博文hive 数据压缩与存储格式选择https://blog.csdn.net/wjl7813/article/details/79285542

6. Shuffle操作

shuffle是按照key将数据发送到不同的reduce,产生磁盘与网络IO,如果key分布不均匀,会产生数据倾斜.

通过下面的reduce源码也可以看到shuffle的解释:
The Reducer copies the sorted output from each Mapper using HTTP across the network.解释里面有一个关键词 copy ,下面会详细将copy干了什么?

6.1 点开Reduce的源码注释,可以看到Shuffle --> Sort --> SecondarySort

 Shuffle --> Sort --> SecondarySort排序     可以做二次排序比如一个二次排序的hive sql : select name,age,sex from student order by age asc,sex desc;这个上去了就是转化成mr进行二次排序的操作.* <p><code>Reducer</code> has 3 primary phases:</p>* <ol>*   <li>*   *   <h4 id="Shuffle">Shuffle</h4>*   *   <p>The <code>Reducer</code> copies the sorted output from each *   {@link Mapper} using HTTP across the network.</p>*   </li>*   *   <li>*   <h4 id="Sort">Sort</h4>*   *   <p>The framework merge sorts <code>Reducer</code> inputs by *   <code>key</code>s *   (since different <code>Mapper</code>s may have output the same key).</p>*   *   <p>The shuffle and sort phases occur simultaneously i.e. while outputs are*   being fetched they are merged.</p>*      *   <h5 id="SecondarySort">SecondarySort</h5>*   *   <p>To achieve a secondary sort on the values returned by the value *   iterator, the application should extend the key with the secondary*   key and define a grouping comparator. The keys will be sorted using the*   entire key, but will be grouped using the grouping comparator to decide*   which keys and values are sent in the same call to reduce.The grouping *   comparator is specified via *   {@link Job#setGroupingComparatorClass(Class)}. The sort order is*   controlled by *   {@link Job#setSortComparatorClass(Class)}.</p>
@Checkpointable
@InterfaceAudience.Public
@InterfaceStability.Stable
public class Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT> {
......
}

6.2 Copy – reduch fetch map端输出数据

1.首先,Reduce任务通过HTTP向各个Map任务下载获取的数据（需要网络传输）
为了预防reduce任务失败需要重做，map输出数据是在整个作业完成之后才被删除掉,
此处有一个参数mapreduce.job.reduce.slowstart.completedmaps来控制Map完成多少,开始reduce作业.2.其次,默认情况下，每个Reducer只会有5个map端并行的下载线程在从map下载数据.
在Reducer内存和网络都比较好的情况下，可以调大该参数mapreduce.reduce.shuffle.parallelcopies;
大图上的是开启了3个线程下载map端的数据. 3.reduce的每一个下载线程在下载map数据时出错，调整等待时间，尝试从其他地方下载；
如果集群环境的网络本身是瓶颈，那么用户可以通过调大这个参数来避免reduce下载线程被误判为失败的情况。
一般这种超时参数 mapreduce.reduce.shuffle.read.timeout,都是可以调大的,可以保证mapreduce运行的稳定性.

6.3 MergeSort – 重点 Merge的三种方式

这里的merge和map端的merge动作类似，只是数组中存放的是不同map端copy来的数值。1.Copy过来的数据会先放入内存缓冲区中，然后当使用内存达到一定量的时候才spill磁盘。这里的缓冲区大小要比map端的更为灵活，它基于JVM的heap size设置,参数为mapreduce.reduce.shuffle.input.buffer.percent（default 0.7f)2.内存到磁盘merge的启动门限可以通过mapreduce.reduce.shuffle.merge.percent（default0.66）配置,也就是溢写阈值为0.66.如果该reduce task的最大heap使用量（通常通过mapreduce.admin.reduce.child.java.opts来设置，
比如设置为-Xmx1024m）的一定比例用来缓存数据。默认情况下，reduce会使用其heapsize的70%来在
内存中缓存数据。假设 mapreduce.reduce.shuffle.input.buffer.percent 为0.7，reducetask的max
heapsize为1G，那么用来做下载数据缓存的内存就为大概700MB左右。这700M的内存，跟map端一样，
也不是要等到全部写满才会往磁盘刷的，而是当这700M中被使用到了一定的限度（通常是一个百分比），
就会开始往磁盘刷（刷磁盘前会先做sortMerge）。这个限度阈值也是可以通过参数mapreduce.reduce.
shuffle.merge.percent（default0.66）来设定。与map端类似，这也是溢写的过程，这个过程中如果
你设置有Combiner，也是会启用的，然后在磁盘中生成了众多的溢写文件。这种merge方式一直在运行，
直到没有map端的数据时才结束，然后启动磁盘到磁盘的merge方式生成最终的那个文件3.[重点]这里Merge有三种形式3.1 内存到内存（memToMemMerger）Hadoop定义了一种MemToMem合并，这种合并将内存中的map输出合并，然后再写入内存。这种合并默认关闭，
可以通过mapreduce.reduce.merge.memtomem.enabled(default:false)打开，当map输出文件达到mapreduce.
reduce.merge.memtomem.threshold时，触发这种合并。3.2 内存中Merge（inMemoryMerger)当缓冲中数据达到配置的阈值时，这些数据在内存中被合并、写入机器磁盘。阈值有2种配置方式：(1)配置内存比例 mapreduce.reduce.shuffle.merge.percent(default:0.66)官网解释: The usage threshold at which an in-memory merge will be initiated, expressed asa percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.前面提到reduceJVM堆内存的一部分用于存放来自map任务的输入，在这基础之上配置一个开始合并数据的比例。假设用于存放map输出的内存为500M，mapreduce.reduce.shuffle.merge.percent配置为0.66，则当内存中的数据达到330M的时候，会触发合并写入。(2)配置map输出数量 mapreduce.reduce.merge.inmem.threshold(default:1000)官网解释: The threshold, in terms of the number of files for the in-memory merge process. When we accumulate threshold number of files we initiate the in-memory merge and spill to disk.A value of 0 or less than 0 indicates we want to DON'T have any threshold and insteaddepend only on the ramfs's memory consumption to trigger the merge.通过mapreduce.reduce.merge.inmem.threshold配置。在合并的过程中，会对被合并的文件做全局的排序。如果作业配置了Combiner，则会运行combine函数，减少写入磁盘的数据量。3.3 磁盘上的Merge（onDiskMerger）(1)Copy过程中磁盘Merge:在copy过来的数据不断写入磁盘的过程中，一个后台线程会把这些文件合并为更大的、有序的文件。如果map的输出结果进行了压缩，则在合并过程中，需要在内存中解压后才能给进行合并。这里的合并只是为了减少最终合并的工作量，也就是在map输出还在拷贝时，就开始进行一部分合并工作。合并的过程一样会进行全局排序。(2)最终磁盘中Merge : 将上面不同方式的Merge,进行最终的合并.最后（所以map输出都拷贝到reduce之后）进行合并的map输出可能来自合并后写入磁盘的文件，也可能来及内存缓冲，在最后写入内存的map输出可能没有达到阈值触发合并，所以还留在内存中。mapreduce.task.io.sort.factor（default：10）也是作用于reduce端的合并因子.每一轮合并不一定合并平均数量的文件数，指导原则是使用整个合并过程中写入磁盘的数据量最小，为了达到这个目的，则需要最终的一轮合并中合并尽可能多的数据，因为最后一轮的数据直接作为reduce的输入，无需写入磁盘再读出。因此我们让最终的一轮合并的文件数达到最大，即合并因子的值，通过mapreduce.task.io.sort.factor（default：10）来配置

6.4 生产调优参数

6.4.1 mapreduce.job.reduce.slowstart.completedmaps(default:0.05)

   官网解释: Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.

配置多少才合适呢？

   mapreduce.job.reduce.slowstart.completedmaps这个参数(1)如果设置的过低，那么reduce就会过早地申请资源，造成资源浪费,试想如果资源紧张的情况下,过早申请reduc资源,会导致map需要资源去跑，reduce需要等map全部跑完才能进行下一个阶段，这样就形成相互等待，类似死锁的情况,最后AppMaster会kill掉reduce释放资源给map；(2)如果这个参数设置的过高，比如为1，那么只有当map全部完成后，才为reduce申请资源，开始进行reduce操作，实际上是串行执行，不能采用并行方式充分利用资源。如果map数量比较多，一般建议提前开始为reduce申请资源。<property><name>mapreduce.job.reduce.slowstart.completedmaps</name><value>0.7</value><description>当MAP完成多少后，申请REDUCE资源开始执行REDUCE,默认0.05</description>
</property>评价:生产上一般配置0.7或0.8比较好.

6.4.2 mapreduce.reduce.shuffle.parallelcopies (default:5)

 官网解释:The maximum number of ms the reducer will delay before retrying to download map data.

默认情况下，每个Reducer只会有5个map端并行的下载线程在从map下数据，如果一个时间段内job完成的map有100个
或者更多，那么reduce也最多只能同时下载5个map的数据，如果想调大改参数,需要:

(1)map很多并且完成的比较快的job的情况下调大，有利于reduce更快的获取属于自己部分的数
(2)reducer内存和网络都比较好评价:生产上一般不掉正

6.4.3 mapreduce.reduce.shuffle.read.timeout(default:180000ms)

官网解释:
Expert: The maximum amount of time (in milli seconds) reduce task waits for map output data to be available for reading after obtaining connection.reduce的每一个下载线程在下载某个map数据的时候，有可能因为那个map中间结果所在机器发生错误，
或者中间结果的文件丢失，或者网络瞬断等等情况，这样reduce的下载就有可能失败，所以reduce的
下载线程并不会无休止的等待下去，当一定时间后下载仍然失败，那么下载线程就会放弃这次下载，
并在随后尝试从另外的地方下载（因为这段时间map可能重跑）。reduce下载线程的这个最大的等待时间就是这个参数.评价: 一般这种超时等待时间,都是可以调大的.如果没有出现reduce下载异常,不调整也是可以的.

6.4.4 mapreduce.reduce.shuffle.input.buffer.percent(default:0.70)

(1)官网解释:The percentage of memory to be allocated from the maximum heap size to storing map outputs during the shuffle意思是说，shuffile在reduce内存中的数据最多使用内存量为：0.7 × maxHeap of reduce task。JVM的heapsize的70%(2)评价: 生产中可以调大改值,但是要看情况<property><name>mapreduce.reduce.shuffle.input.buffer.percent</name><value>0.81</value><description>shuffle最大中REDUCE内存百分比,默认0.70  copy阶段用于保存map输出的堆内存比例</description>
</property><property><name>mapreduce.reduce.shuffle.memory.limit.percent</name><value>0.25</value><description>单个shuffle最大中REDUCE内存百分比,默认0.25  开始spill的缓冲池比例阈值</description>
</property>在使用phoenix的BulkLoad的时候,如果将mapreduce.reduce.shuffle.input.buffer.percent 调整的过大,
会出现如下错误:Error: org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in shuffle in fetcher#3
Caused by: java.lang.OutOfMemoryError: Java heap space(3)此时需要将mapreduce.reduce.shuffle.input.buffer.percent调小,更多的溢写到磁盘,具体将mapred-site.xml
文件这两个参数进行设置,就能够实现phoenix的bulkLoad不报错<property><name>mapreduce.reduce.shuffle.input.buffer.percent</name><value>0.6</value><description>默认0.7,shuffle使用的内存比例0.6。 copy阶段用于保存map输出的堆内存比例</description>
</property><property><name>mapreduce.reduce.shuffle.memory.limit.percent</name><value>0.15</value><description>默认0.25,单个shuffle任务能使用的内存限额，设置为0.15，即为 Shuffle内存 * 0.15。低于此值可以输出到内存，否则输出到磁盘</description>
</property><property><name>mapreduce.reduce.shuffle.merge.percent</name><value>0.9</value><description>默认0.66.shuffle的数据量到Shuffle内存 ** 0.9的时候，启动合并。 开始spill的缓冲池比例阈值</description>
</property># phoenix bulkLoad (将csv数据批量bulkLoad到phoenix)
HADOOP_CLASSPATH=/phoenix/lib-aux/*:/hbase/hbase-1.4.8/lib/hbase-protocol-1.4.8.jar:/hbase/hbase-1.4.8/conf \
/hadoop/hadoop-2.9.1/bin/hadoop jar /phoenix/lib-aux/phoenix-4.14.1-HBase-1.4-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool  \
-conf /datax/bulkLoadConfig/mapred-site.xml \
--zookeeper  zookeeper01,zookeeper02,zookeeper03:2181:/hbase/db \
--schema ODS \
--table XXX \
--input /tmp/XXX \
--output  /tmp/XXX# 参考资料
MapReduce 在Shuffle阶段 内存溢出原因分析及处理方法
https://blog.csdn.net/houzhizhen/article/details/84773884

6.4.5 mapreduce.reduce.shuffle.merge.percent(default:0.66)

官网解释: The usage threshold at which an in-memory merge will be initiated, expressed as a percentage of the total memory allocated to storing in-memory map outputs, as defined by mapreduce.reduce.shuffle.input.buffer.percent.意思是说,shuffile在reduce内存中的数据最多使用内存量使用到了一定的限度（通常是一个百分比），就会开始往磁盘刷（刷磁盘前会先做sortMerge）.评价: 生产上一般不调整此参数,在一些特定的情况下需要调整,如上述 phoenix 的bulkLoad

7. Reduce

7.1 Reduce 介绍

reduce数量决定最终文件的输出数量当reduce将所有的map上对应自己partition的数据下载完成后，就会开始真正的reduce计算阶段(1)mapreduce.reduce.input.buffer.percent(default 0.0)
The percentage of memory- relative to the maximum heap size- to retain map outputs during
the reduce. When the shuffle is concluded, any remaining map outputs in memory must consume
less than this threshold before the reduce can begin.从上述参数可以看出,默认情况下，reduce是全部从磁盘开始读处理数据.如果这个参数大于0，那么就会有
一定量的数据被缓存在内存并输送给reduce，当reduce计算逻辑消耗内存很小时，可以分一部分内存用来缓存数据
，可以提升计算的速度。所以默认情况下都是从磁盘读取数据，如果内存足够大的话，务必设置该参数让reduce
直接从缓存读数据，这样做就有点Spark Cache的感觉.(2)Reduce在这个阶段，框架为已分组的输入数据中的每个 <key, (list of values)>对调用一次 reduce(WritableComparable,Iterator, OutputCollector, Reporter)方法。Reduce任务的输出通常是通过
调用 OutputCollector.collect(WritableComparable,Writable)写入文件系统的.

7.2 生产调优参数

7.2.1 mapreduce.reduce.memory.mb(default:1024m)

<property><name>mapreduce.reduce.memory.mb</name><value>2048</value><description>REDUCE申请的内存大小(3072) </description>
</property><property><name>mapreduce.map.memory.mb</name><value>1024</value><description>MAP申请的内存大小3072</description>
</property>评价:   单个map reduce的容器大小默认1024M,生产上可根据情况适当调大

7.2.2 mapred.child.java.opts(dfault:-Xmx200m)

运行map和reduce任务的JVM,下面是官方解释:

 Java opts for the task processes. The following symbol, if present, will be interpolated: @taskid@ isreplaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enableverbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte,pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc Usage of -Djava.library.path can causeprograms to no longer function if hadoop native libraries are used. These values should instead be set aspart of LD_LIBRARY_PATH in the map / reduce JVM env using the mapreduce.map.env and mapreduce.reduce.env config settings.<property><name>mapred.child.java.opts</name><value>-Xmx1200m</value><description>MAP REDUCE运行的JVM内存</description></property>评价: 生产上可以将此参数过大5-10倍

7.3.3 Reduce端的输出压缩格式 Snappy 或 Lzo

<property><name>mapreduce.output.fileoutputformat.compress</name><value>true</value><description>最终结果输出压缩</description>
</property><property><name>mapreduce.output.fileoutputformat.compress.codec</name><value>org.apache.hadoop.io.compress.SnappyCodec</value><description>压缩类</description>
</property><property><name>mapreduce.output.fileoutputformat.compress.type</name><value>BLOCK</value><description>压缩类型</description>
</property>评价: (1)如果是hive可以通过参数控制输出的文件大小,保证每个除数的文件大小约等于128M, 此时可以使用Snappy.(2)但是如果不能够控制输出文件的大小,使用了Snappy,导致后面在处理这批数据(比如2G),只能够开启一个map来处理,因为Snappy是不能够分割的,所以需要使用Lzo这种能够分割的数据格式.# 参考hive 数据压缩与存储格式选择https://blog.csdn.net/wjl7813/article/details/79285542MapReduce过程详解及其性能优化https://blog.csdn.net/aijiudu/article/details/72353510

从MapReduce的Shuffle原理进行生产参数调优相关推荐

粒子群算法原理|python实现|参数调优
粒子群算法是比较有名的群体智能算法之一,其他群体智能算法还包括蚁群算法.鱼群算法.人工蜂群算法等.今天学习一下粒子群算法. 文章目录算法原理(Inspiration) 优化过程 python实现参 ...
ambari_HDP之mapreduce参数调优
一.基础环境基础环境: centos6.8 ambari版本: 2.2.1 HDP版本: 2.3.2.0 HDP的默认配置文件:(本文以ambari-2.2.1为例): 配置文件目录:/usr/hd ...
HBase原理 | HBase Compaction介绍与参数调优
我们知道,数据达到HBase服务端会写WAL-写Memstore,然后定期或满足一定条件时刷写磁盘生成一个HFile文件,随着时间推移生成的HFile会越来越多,将会影响HBase查询性能,同时会对H ...
spark原理参数调优
一.spark原理参考: Hive on Spark调优_窗外的屋檐-CSDN博客_spark.executor.instancesSpark资源参数调优参数_TURING.DT-CSDN博客_sp ...
贝叶斯优化原理及应用[附XGBoost、LightGBM超参数调优代码][scikit-optimize]
近年来机器学习和深度学习算法被越来越广泛的应用于解决对未知数据的预测问题.由于超参数的选择对模型最终的效果可能有极大的影响,为了使模型达到更好的效果,通常会面临超参数调优问题.但如何选择合适的超参数并 ...
Golang 侧数据库连接池原理和参数调优
Golang 侧数据库连接池原理和参数调优文章目录 Golang 侧数据库连接池原理和参数调优数据库连接池数据库连接池的设计 Go 的数据库连接池 Go 数据库连接池的设计建立连接释放连接 ...
由美团技术文章整理---spark性能优化基础篇--开发调优与资源参数调优
文章地址1:Spark性能优化指南--基础篇 - 美团技术团队文章地址2:Spark性能优化指南--高级篇 - 美团技术团队目录一.关于性能优化基础篇--开发调优 1.避免创建重复RDD (1) ...
php+php-fom+nginx配置参数调优详解
文章目录一.前言 1.mysql配置参数: 2.注意二.php参数配置及讲解 1.phpini的基本设置 2.php参数设置三.php-fpm设置 1.设置子进程数,增加并发量 2.防止频繁出现 ...
spark 资源参数调优
资源参数调优了解完了Spark作业运行的基本原理之后,对资源相关的参数就容易理解了.所谓的Spark资源参数调优,其实主要就是对Spark运行过程中各个使用资源的地方,通过调节各种参数,来优化资源使 ...

从MapReduce的Shuffle原理进行生产参数调优

文章目录

1.mapreduce的过程上图

2.map 切分输入文件

3.环形缓冲区

3.1 原理

3.2 生产调优

3.2.1 mapreduce.task.io.sort.mb(default:100m)

3.2.2 mapreduce.map.sort.spill.percent(default:0.80)

4.数据在spill到磁盘之前会做partition,sort操作

4.1 原理

4.2 生产调优

5. 溢写到磁盘(spill to disk)

5.1 原理

5.2 生产调优参数

5.2.1 mapreduce.task.io.sort.factor（default：10）

5.2.2 mapreduce.map.combine.minspills （default：3）

5.2.3 Map输出生产设置Snappy压缩

6. Shuffle操作

6.1 点开Reduce的源码注释,可以看到Shuffle --> Sort --> SecondarySort

6.2 Copy – reduch fetch map端输出数据

6.3 MergeSort – 重点 Merge的三种方式

6.4 生产调优参数

6.4.1 mapreduce.job.reduce.slowstart.completedmaps(default:0.05)

6.4.2 mapreduce.reduce.shuffle.parallelcopies (default:5)

6.4.3 mapreduce.reduce.shuffle.read.timeout(default:180000ms)

6.4.4 mapreduce.reduce.shuffle.input.buffer.percent(default:0.70)

6.4.5 mapreduce.reduce.shuffle.merge.percent(default:0.66)

7. Reduce

7.1 Reduce 介绍

7.2 生产调优参数

7.2.1 mapreduce.reduce.memory.mb(default:1024m)

7.2.2 mapred.child.java.opts(dfault:-Xmx200m)

7.3.3 Reduce端的输出压缩格式 Snappy 或 Lzo

从MapReduce的Shuffle原理进行生产参数调优相关推荐

最新文章

热门文章

从MapReduce的Shuffle原理 进行 生产参数调优

文章目录

1.mapreduce的过程上图

2.map 切分输入文件

3.环形缓冲区

3.1 原理

3.2 生产调优

3.2.1 mapreduce.task.io.sort.mb(default:100m)

3.2.2 mapreduce.map.sort.spill.percent(default:0.80)

4.数据在spill到磁盘之前会做partition,sort操作

4.1 原理

4.2 生产调优

5. 溢写到磁盘(spill to disk)

5.1 原理

5.2 生产调优参数

5.2.1 mapreduce.task.io.sort.factor（default：10）

5.2.2 mapreduce.map.combine.minspills （default：3）

5.2.3 Map输出生产设置Snappy压缩

6. Shuffle操作

6.1 点开Reduce的源码注释,可以看到Shuffle --> Sort --> SecondarySort

6.2 Copy – reduch fetch map端输出数据

6.3 MergeSort – 重点 Merge的三种方式

6.4 生产调优参数

6.4.1 mapreduce.job.reduce.slowstart.completedmaps(default:0.05)

6.4.2 mapreduce.reduce.shuffle.parallelcopies (default:5)

6.4.3 mapreduce.reduce.shuffle.read.timeout(default:180000ms)

6.4.4 mapreduce.reduce.shuffle.input.buffer.percent(default:0.70)

6.4.5 mapreduce.reduce.shuffle.merge.percent(default:0.66)

7. Reduce

7.1 Reduce 介绍

7.2 生产调优参数

7.2.1 mapreduce.reduce.memory.mb(default:1024m)

7.2.2 mapred.child.java.opts(dfault:-Xmx200m)

7.3.3 Reduce端的输出压缩格式 Snappy 或 Lzo

从MapReduce的Shuffle原理 进行 生产参数调优相关推荐

最新文章

热门文章

从MapReduce的Shuffle原理进行生产参数调优

从MapReduce的Shuffle原理进行生产参数调优相关推荐