java程序员的大数据之路（2）：创建第一个Hadoop程序

环境

Ubuntu 16.04 + Hadoop 2.7.4 + Intellij idea 2017.2 + jdk 1.8

创建过程

新建工程

新建一个工程
输入工程名
可以随便给工程起一个名字，这里我写了“firstHadoop”，然后Finish。

创建包结构及java文件

由于工程比较简单，这里直接将WordCount.java文件放到src目录下，项目大的话还是要创建合理的包结构的。

WordCount代码如下（hadoop自带例子中也有）：

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.*;import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
public class WordCount {public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{private final static IntWritable one = new IntWritable(1);private Text word = new Text();@Overridepublic void map(LongWritable longWritable, Text text, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter) throws IOException {String line = text.toString();StringTokenizer tokenizer = new StringTokenizer(line);while(tokenizer.hasMoreTokens()){word.set(tokenizer.nextToken());outputCollector.collect(word,one);}}}public static class Reduce extends MapReduceBase implements Reducer<Text,IntWritable,Text, IntWritable>{@Overridepublic void reduce(Text text, Iterator<IntWritable> iterator, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter) throws IOException {int sum= 0;while(iterator.hasNext()){sum += iterator.next().get();}outputCollector.collect(text, new IntWritable(sum));}}public static void main(String[] args) throws Exception{JobConf conf = new JobConf(WordCount.class);conf.setJobName("wordCount");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(IntWritable.class);conf.setMapperClass(Map.class);conf.setReducerClass(Reduce.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf,new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf);}
}

这时代码会有很多错误提示，接下来导入hadoop所需jar包

导入Hadoop依赖的jar包

在工程名点击右键，选择Open Module Settings，或直接使用快捷键Ctrl+Shift+Alt+S，打开Project Structure，找到Modules中的Dependencies，如果没有JDK，先添加JDK。
1. 添加Hadoop依赖的jar包
点击右侧+号，（2016版本idea在下面），选择JARs or directories
将下载的Hadoop目录下share/hadoop文件夹下的目录逐个添加进来。这里我的目录是/usr/local/hadoop/share/hadoop，可以根据自己的hadoop所在目录修改

2. 添加Artifacts
新建一个空的jar，输入名字后，添加Module Output，如图

加入我们创建的项目，点击确定。至此，hadoop依赖的jar包已添加完毕，代码中的错误也消失了，如果没有可以手动编译一下。

运行时配置

接下来配置 Run/Debug Configurations
1. 点击“+”号新建一个Application，起一个名字。

选择main函数所在位置，这里直接就选择WordCount文件（class文件）
2. 配置Program arguments

两个参数分别为输入文件和输出文件所在位置。其中hdfs://localhost:9000是与hadoop的配置有关，就是core-site.xml配置文件中的fs.defaultFS属性的值。这里根据自己配置的值自行调整，后面为文件绝对路径。

将文件放入HDFS文件系统

如果没有启动hadoop，先启动hadoop。启动后，将本地文件put到HDFS文件系统中。需要注意文件路径，如果输入

hadoop fs -put xxx.txt input

则文件的绝对路径是/user/你的用户名/input。

如果输入

hadoop fs -put xxx.txt /input

则文件的绝对路径是/input
输出路径同理

执行程序并查看结果

在代码区右键，然后运行程序，执行完成之后，从相应的输出路径将执行结果get到本地，即可查看结果。

hadoop fs -get /output ./output

至此，我的第一个Hadoop程序已经创建完成。

参考文章

intellij idea本地开发调试hadoop的方法
使用IntelliJ IDEA 16.1写hadoop程序