MapReduce处理流程wordCount源码解析和操作流程

操作文档参考：
http://hadoop.apache.org/docs/r2.6.0/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

MapReduce处理流程：

输入: 一系列的键值对<k1,v1>
map: map<k1,v1>转换成<k2,v2>
reduce: <k2,v2>转换成<k3,v3>
输出: 一系列的键值对<k3,v3>

流程解析：读取文件 splitting拆分 mapping计算 shuffling洗牌排序汇总结果统计

wordCount源码解析：

1.wordCount代码如下：

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;public class WordCount {//自定义mapper处理类public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {//map：分割任务。 编程模型中的Mapping@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {LongWritable one = new LongWritable(1);//接收到每一行数据String line = value.toString();//按照指定的分隔符进行拆分String[] words = line.split(" ");//遍历每个单词，并且通过context把map的处理结果进行输出for (String word : words) {context.write(new Text(word), one);}}}//自定义reducer处理类(map的输出作为reduce的输入)public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {@Overrideprotected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {//设置统计的初始值为0long sum = 0;for (LongWritable value : values) {sum += value.get();}//统计最终的结果context.write(key, new LongWritable(sum));}}//处理作业public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {//创建Configuration对象Configuration cfg = new Configuration();//创建jobJob job = Job.getInstance(cfg, "wordcount");//设置job的处理类job.setJarByClass(WordCount.class);//设置map相关的job.setMapperClass(MyMapper.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);//设置reduce相关的/*** 深入了解Combiners编程(相当于Map端的Reduce)每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少 传输到reducer的数据量。combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner， 先完成的map会在本地聚合，提升速度。注意:Combiner的输出是Reducer的输入，Combiner绝不能改变最终的计算结果。所以，Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致， 且不影响最终结果的场景。比如累加，最大值等。*/job.setCombinerClass(MyReducer.class);job.setReducerClass(MyReducer.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);//判断输入和输出目录是否存在FileSystem fs = FileSystem.get(cfg);if (!fs.exists(new Path(args[0]))) {System.out.println("您要处理的目录不存在!");}//输出路径如果事先存在，程序会报错。fs.deleteOnExit( new Path(args[1]));//设置作业处理的输入路径FileInputFormat.addInputPath(job, new Path(args[0]));//设置作业处理的输出路径FileOutputFormat.setOutputPath(job, new Path(args[1]));//执行作业boolean result = job.waitForCompletion(true);System.exit(result ? 0 : 1);}
}

2.wordCount代码编辑完毕后，给项目添加主类：File-Project Structure

3.package打成jar包，到target取出jar包，传到Ubuntu。
4.虚拟机Ubuntu启动。启动dfs，jps查看检查：

start-dfs.sh
jps

dfs启动成功后可到IP+50070访问到：

5.启动yarn：

start-yarn.sh

yarn启动成功后可访问到IP+8088网址：

6.把编辑好的文件hello.txt（数据模拟）先上传到虚拟机的opt下，再上传到hadoop的input下，并查看

hadoop dfs -put /opt/hello.txt /input
hadoop dfs -text /input/hello.txt
#查看显示的内容如下：
deer bear river
car car river
Deer car bear

7.把打包好的jar放到app的temp路径下，运行jar包：

hadoop jar test-hdfs-1.0-SNAPSHOT.jar com.kgc.WordCount /input /output

8.执行完毕，可在50070网页查看到增加了output文件夹，里面还多了2个文件，可查看到对hello.txt的内容进行了计数统计：
查看：

hadoop dfs -cat /output/part-r-00000

在任务运行过程中查看 http://192.168.1.14:8088 ，可以看到任务的状态