MapReduce学习总结之Combiner、Partitioner、Jobhistory

一、Combiner

在MapReduce编程模型中，在Mapper和Reducer之间有一个非常重要的组件，主要用于解决MR性能瓶颈问题

combiner其实属于优化方案，由于带宽限制，应该尽量map和reduce之间的数据传输数量。它在Map端把同一个key的键值对合并在一起并计算，计算规则和reduce一致，所以combiner也可以看作特殊的Reducer(本地reduce)。
执行combiner操作要求开发者必须在程序中设置了combiner（程序中通过job.setCombinerClass(myCombine.class)自定义combiner操作）

wordcount中直接使用myreduce作为combiner:

// 设置Map规约Combinerjob.setCombinerClass(MyReducer.class);

参考资料：https://www.tuicool.com/articles/qAzUjav

二、Partitioner

Partitioner也是MR的重要组件，主要功能如下：

1）Partitioner决定MapTask输出的数据交由哪个ReduceTask处理

2）默认实现：分发的key的hash值对reduceTask 个数取模

which reducer=(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks，得到当前的目的reducer。

例子：

文件内容：xiaomi 200            huawei 500xiaomi 300huawei 700iphonex 100iphonex 30iphone7 60
对上面文件内容按手机品牌分类分发到四个reduce处理计算：package rdb.com.hadoop01.mapreduce;import java.io.IOException;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;/*** * @author rdb**/
public class PartitionerApp {/*** map读取输入文件* @author rdb**/public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{@Overrideprotected void map(LongWritable key, Text value,Mapper<LongWritable, Text, Text, LongWritable>.Context context)throws IOException, InterruptedException {//接收每一行数据String line = value.toString();//按空格进行分割 String[] words = line.split(" ");//通过上下文把map处理结果输出context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));}}/*** reduce程序，归并统计* @author rdb**/public static class MyReduce extends Reducer<Text, LongWritable, Text, LongWritable>{@Overrideprotected void reduce(Text key, Iterable<LongWritable> values,Reducer<Text, LongWritable, Text, LongWritable>.Context context)throws IOException, InterruptedException {long sum = 0;for (LongWritable value : values){//求单词次数sum += value.get();}//通过上下文把reduce处理结果输出context.write(key, new LongWritable(sum));}}/*** 自定义partition* @author rdb**/public static class MyPartitioner extends Partitioner<Text, LongWritable>{@Overridepublic int getPartition(Text key, LongWritable value, int numPartitions) {if(key.toString().equals("xiaomi")){return 0;}if(key.toString().equals("huawei")){return 1;}if(key.toString().equals("iphonex")){return 2;}return 3;}}/*** 自定义driver:封装mapreduce作业所有信息*@param args* @throws IOException */public static void main(String[] args) throws Exception {//创建配置Configuration configuration = new Configuration();//清理已经存在的输出目录Path out = new Path(args[1]);FileSystem fileSystem = FileSystem.get(configuration);if(fileSystem.exists(out)){fileSystem.delete(out, true);System.out.println("output exists,but it has deleted");}//创建jobJob job = Job.getInstance(configuration,"WordCount");//设置job的处理类job.setJarByClass(PartitionerApp.class);//设置作业处理的输入路径FileInputFormat.setInputPaths(job, new Path(args[0]));//设置map相关的参数job.setMapperClass(MyMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);//设置reduce相关参数job.setReducerClass(MyReduce.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);//设置combiner处理类，逻辑上和reduce是一样的//job.setCombinerClass(MyReduce.class);//设置job partitionjob.setPartitionerClass(MyPartitioner.class);//设置4个reducer,每个分区一个 job.setNumReduceTasks(4);//设置作业处理的输出路径FileOutputFormat.setOutputPath(job, new Path(args[1]));System.exit(job.waitForCompletion(true)? 0 : 1) ;}
}打包后调用：hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.PartitionerApp
hdfs://hadoop01:8020/partitioner.txt  hdfs://hadoop01:8020/output/partitioner结果： -rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00000-rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00001-rw-r--r--   1 hadoop supergroup         12 2018-05-09 06:35 /output/partitioner/part-r-00002-rw-r--r--   1 hadoop supergroup         11 2018-05-09 06:35 /output/partitioner/part-r-00003[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00000
18/05/09 06:36:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
xiaomi  500
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00001
18/05/09 06:36:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
huawei  1200
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00002
18/05/09 06:36:49 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
iphonex 130
[hadoop@hadoop01 lib]$ hadoop fs -text /output/partitioner/part-r-00003
18/05/09 06:36:57 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
iphone7 60

三、Jobhistory

JobHistory用来记录已经finished的mapreduce运行日志，日志信息存放于HDFS目录中，默认情况下没有开启此功能。需要配置。

1）配置hadoop-2.6.0-cdh5.7.0/etc/hadoop/mapred-site.xml

<property><name>mapreduce.jobhistory.address</name><value>hadoop01:10020</value><description>MR JobHistory Server管理的日志的存放位置</description>
</property>
<property><name>mapreduce.jobhistory.webapp.address</name><value>hadoop01:19888</value><description>查看历史服务器已经运行完的Mapreduce作业记录的web地址，需要启动该服务才行</description>
</property>
<property><name>mapreduce.jobhistory.done-dir</name><value>/history/done</value><description>MR JobHistory Server管理的日志的存放位置,默认:/mr-history/done</description>
</property>
<property><name>mapreduce.jobhistory.intermediate-done-dir</name><value>/history/done_intermediate</value><description>MapReduce作业产生的日志存放位置，默认值:/mr-history/tmp</description>
</property>

2）配置好后重启yarn.启动jobhistory服务：hadoop-2.6.0-cdh5.7.0/sbin/mr-jobhistory-daemon.sh start historyserver

[hadoop@hadoop01 sbin]$ jps
24321 JobHistoryServer
24353 Jps
23957 NodeManager
7880 DataNode
8060 SecondaryNameNode
23854 ResourceManager
7791 NameNode
[hadoop@hadoop01 sbin]$

3）浏览器访问：http://192.168.44.183:19888/

后台跑一个MapReduce程序：hadoop jar ~/lib/hadoop01-0.0.1-SNAPSHOT.jar rdb.com.hadoop01.mapreduce.WordCountApp hdfs://hadoop01:8020/hello.txt hdfs://hadoop01:8020/output/wc

刷新下浏览器可以看到刚才程序的日志：

点击页面中对应mr程序中的logs可以看详细日志。

问题记录：

转载于:https://www.cnblogs.com/jnba/p/10670828.html

MapReduce学习总结之Combiner、Partitioner、Jobhistory相关推荐

hadoop之MapReduce学习教程
hadoop之MapReduce学习 MapReduce概述 MapReduce定义 MapReduce是一个分布式运算程序的编程框架,是用户开发"基于Hadoop的数据分析应用" ...
MapReduce学习笔记(1)
MapReduce学习笔记 1. MapReduce编程模型- Hadoop架构 1.1 Map阶段 1.2 Reduce阶段 1.3 MapReduce模型图 2. MapReduce编程示例 2. ...
Hadoop之MapReduce学习笔记（二）
主要内容: mapreduce编程模型再解释: ob提交方式: windows->yarn windows->local : linux->local linux->yarn: ...
Mapreduce学习指导及疑难解惑汇总
Mapreduce学习指导及疑难解惑汇总 1.思想起源: 我们在学习mapreduce,首先我们从思想上来认识.其实任何的奇思妙想,抽象的,好的想法.都来源于我们生活,而我们也更容易理解我们身边所发生 ...
MapReduce 学习指南
大数据原理与应用第七章 MapReduce 学习指南该指南为厦门大学林子雨编著的<大数据技术原理与应用>教材配套学习资料
mapreduce优化之自定义combiner
自定义combiner map端合并数据,减少网络io 一.普通的combiner 二.自定义combiner,实现自由合并 map端合并数据,减少网络io 前言:在map端使用combiner合并数 ...
Hadoop学习笔记—8.Combiner与自定义Combiner
一.Combiner的出现背景 1.1 回顾Map阶段五大步骤在第四篇博文<初识MapReduce>中,我们认识了MapReduce的八大步凑,其中在Map阶段总共五个步骤,如下图所示: ...
MapReducer Counter计数器的使用,Combiner ,Partitioner,Sort,Grop的使用,
一:Counter计数器的使用 hadoop计数器:可以让开发人员以全局的视角来审查程序的运行情况以及各项指标,及时做出错误诊断并进行相应处理. 内置计数器(MapReduce相关.文件系统相关和作业 ...
【大数据/分布式】MapReduce学习-结合6.824课程
参考多篇文档.博客,仅供学习记录. 1.简介 MapReduce用于大规模数据集(大于1TB)的并行运算.概念"Map(映射)"和"Reduce(归约)",是它 ...

MapReduce学习总结之Combiner、Partitioner、Jobhistory

MapReduce学习总结之Combiner、Partitioner、Jobhistory相关推荐

最新文章

热门文章