Hadoop系列之Reporter,Partitioner,JobConf, JobClient

Reporter用于报告进度，设定应用级别的状态消息，更新Counters（计数器），或者仅是表明自己运行正常

例如如下代码

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {String line = (caseSensitive) ? value.toString() : value.toString().toLowerCase();for (String pattern : patternsToSkip) {line = line.replaceAll(pattern, "");}StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());output.collect(word, one);reporter.incrCounter(Counters.INPUT_WORDS, 1);}if ((++numRecords % 100) == 0) {reporter.setStatus("Finished processing " + numRecords + " records " + "from the input file: " + inputFile);}}

Partitioner用法： Mapper 的输出被排序后，就被划分给每个 Reducer 。分块的总数目和一个作业的reduce任务的数目是一样的。用户可以通过实现自定义的 Partitioner 来控制哪个key被分配给哪个 Reducer，即Partitioner可以定义对mapper输出数据的key进行分区，对key按照一定的算法进行分组，每组数据交给一个reduce任务执行

JobConf代表一个Map/Reduce作业的配置。

JobConf是用户向Hadoop框架描述一个Map/Reduce作业如何执行的主要接口。框架会按照JobConf描述的信息忠实地去尝试完成这个作业

通常，JobConf会指明Mapper、Combiner(如果有的话)、 Partitioner、Reducer、InputFormat和 OutputFormat的具体实现。JobConf还能指定一组输入文件 (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) 和(setInputPaths(JobConf, String)/addInputPaths(JobConf, String)) 以及输出文件应该写在哪儿 (setOutputPath(Path))。

JobConf可选择地对作业设置一些高级选项，例如：设置Comparator；放到DistributedCache上的文件；中间结果或者作业输出结果是否需要压缩以及怎么压缩；利用用户提供的脚本(setMapDebugScript(String)/setReduceDebugScript(String)) 进行调试；作业是否允许预防性（speculative）任务的执行 (setMapSpeculativeExecution(boolean))/(setReduceSpeculativeExecution(boolean)) ；每个任务最大的尝试次数 (setMaxMapAttempts(int)/setMaxReduceAttempts(int)) ；一个作业能容忍的任务失败的百分比 (setMaxMapTaskFailuresPercent(int)/setMaxReduceTaskFailuresPercent(int)) ；等等。

当然，用户能使用 set(String, String)/get(String, String) 来设置或者取得应用程序需要的任意参数。然而，DistributedCache的使用是面向大规模只读数据的。

JobClient是用户提交的作业与JobTracker交互的主要接口。

JobClient 提供提交作业，追踪进程，访问子任务的日志记录，获得Map/Reduce集群状态信息等功能。

作业提交过程包括：

检查作业输入输出样式细节
为作业计算InputSplit值。
如果需要的话，为作业的DistributedCache建立必须的统计信息。
拷贝作业的jar包和配置文件到FileSystem上的Map/Reduce系统目录下。
提交作业到JobTracker并且监控它的状态。

JobClient的用法示例如下：

public int run(String[] args) throws Exception {JobConf conf = new JobConf(getConf(), CompleteWordCount.class);conf.setJobName("wordcount");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);conf.setCombinerClass(Reduce.class);conf.setReducerClass(Reduce.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);List<String> other_args = new ArrayList<String>();for (int i=0; i < args.length; ++i) {if ("-skip".equals(args[i])) {DistributedCache.addCacheFile(new Path(args[++i]).toUri(), conf);conf.setBoolean("wordcount.skip.patterns", true);} else {other_args.add(args[i]);}}FileInputFormat.setInputPaths(conf, new Path(other_args.get(0)));FileOutputFormat.setOutputPath(conf, new Path(other_args.get(1)));JobClient.runJob(conf);return 0;}

Hadoop系列之Reporter,Partitioner,JobConf, JobClient相关推荐

hadoop系列四:mapreduce的使用(二)
转载请在页首明显处注明作者与出处一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6 ...
Hadoop系列之FieldSelectionMapReduce用法
Hadoop的工具类org.apache.hadoop.mapred.lib.FieldSelectionMapReduce帮助用户高效处理文本数据, 就像unix中的"cut"工 ...
hadoop系列三:mapreduce的使用(一)
一:说明此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6.4 上一篇:hadoop系列二: ...
Hadoop 系列之 Hive
Hadoop 系列之 Hive Hive 的官网:http://hive.apache.org/ Hive versions 1.2 onward require Java 1.7 or newer. ...
Hadoop 系列之 HDFS
Hadoop 系列之 HDFS 花絮上一篇文章 Hadoop 系列之 1.0和2.0架构中,提到了 Google 的三驾马车,关于分布式存储,计算以及列式存储的论文,分别对应开源的 HDFS,Ma ...
Hadoop 系列之 1.0 和2.0 架构
Hadoop 系列之 1.0 和2.0 架构自学大数据有一段时间了,找工作历时一周,找到一家大厂,下周入职,薪资待遇还不错,公司的业务背景自己也很喜欢.趁着还没有入职,给大家争取先把 Hadoop ...
Hadoop 基础系列一Hadoop 系列之 1.0 和2.0 架构
精选30+云产品,助力企业轻松上云!>>> Hadoop 系列之 1.0 和2.0 架构自学大数据有一段时间了,找工作历时一周,找到一家大厂,下周入职,薪资待遇还不错,公司的业务背 ...
小丸子学Hadoop系列之——部署Hbase集群
0.集群规划主机名 ip地址安装的软件运行的进程 AI-OPT-HBS01 10.46.52.30 hadoop,hbase namenode,zkfc,resourcemanager AI-O ...
hadoop系列（一）概念、组件介绍、安装环境、配置
hadoop系列(一)概念.组件介绍.安装环境.配置一.大数据概念概念大数据:解决海量数据的采集.存储.分析计算的能力大数据特点 Volume(大量) Velocity(高速) Variety ...

Hadoop系列之Reporter,Partitioner,JobConf, JobClient

Hadoop系列之Reporter,Partitioner,JobConf, JobClient相关推荐

最新文章

热门文章