大数据培训课程WordCount案例实操

WordCount案例实操

1．需求

在给定的文本文件中统计输出每一个单词出现的总次数

（1）输入数据

（2）期望输出数据

atguigu 2

banzhang 1

cls 2

hadoop 1

jiao 1

ss 2

xue 1

2．需求分析

按照MapReduce编程规范，分别编写Mapper，Reducer，Driver，如图4-2所示。

图4-2 需求分析

3．环境准备

（1）创建maven工程

（2）在pom.xml文件中添加如下依赖

<groupId>junit</groupId>

<artifactId>junit</artifactId>

<version>RELEASE</version>

</dependency>

<groupId>org.apache.logging.log4j</groupId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-common</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-client</artifactId>

</dependency>

<groupId>org.apache.hadoop</groupId>

<artifactId>hadoop-hdfs</artifactId>

</dependency>

</dependencies>

（2）在项目的src/main/resources目录下，新建一个文件，命名为“log4j.properties”，在文件中填入。

log4j.rootLogger=INFO, stdout log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLayout log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] – %m%n log4j.appender.logfile=org.apache.log4j.FileAppender log4j.appender.logfile.File=target/spring.log log4j.appender.logfile.layout=org.apache.log4j.PatternLayout log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] – %m%n

4．编写程序

（1）编写Mapper类

package com.atguigu.mapreduce; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ Text k = new Text(); IntWritable v = new IntWritable(1); @Override protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { // 1 获取一行 String line = value.toString(); // 2 切割 String[] words = line.split(” “); // 3 输出 for (String word : words) { k.set(word); context.write(k, v); } } }

（2）编写Reducer类

package com.atguigu.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ int sum; IntWritable v = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException { // 1 累加求和 sum = 0; for (IntWritable count : values) { sum += count.get(); } // 2 输出 v.set(sum); context.write(key,v); } }

（3）编写Driver驱动类

package com.atguigu.mapreduce.wordcount; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordcountDriver { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { // 1 获取配置信息以及封装任务 Configuration configuration = new Configuration(); Job job = Job.getInstance(configuration); // 2 设置jar加载路径 job.setJarByClass(WordcountDriver.class); // 3 设置map和reduce类 job.setMapperClass(WordcountMapper.class); job.setReducerClass(WordcountReducer.class); // 4 设置map输出 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); // 5 设置最终输出kv类型 job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // 6 设置输入和输出路径 FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 7 提交 boolean result = job.waitForCompletion(true); System.exit(result ? 0 : 1); } }

5．本地测试

（1）如果电脑系统是win7的就将win7的hadoop jar包解压到非中文路径，并在Windows环境上配置HADOOP_HOME环境变量。如果是电脑win10操作系统，就解压win10的hadoop jar包，并配置HADOOP_HOME环境变量。

注意：win8电脑和win10家庭版操作系统可能有问题，需要重新编译源码或者更改操作系统。

（2）在Eclipse/Idea上运行程序

6．集群上测试

（0）用maven打jar包，需要添加的打包插件依赖

注意：标记红颜色的部分需要替换为自己工程主类

<build>

<artifactId>maven-compiler-plugin</artifactId>

</configuration>

</plugin>

<artifactId>maven-assembly-plugin </artifactId>

<descriptorRef>jar-with-dependencies</descriptorRef>

</descriptorRefs>

<mainClass>com.atguigu.mr.WordcountDriver</mainClass>

</manifest>

</archive>

</configuration>

<id>make-assembly</id>

<phase>package</phase>

<goals>

<goal>single</goal>

</goals>

</execution>

</executions>

</plugin>

</plugins>

</build>

注意：如果工程上显示红叉。在项目上右键->maven->update project即可。

（1）将程序打成jar包，然后拷贝到Hadoop集群中

步骤详情：右键->Run as->maven install。等待编译完成就会在项目的target文件夹中生成jar包。如果看不到。在项目上右键-》Refresh，即可看到。修改不带依赖的jar包名称为wc.jar，并拷贝该jar包到Hadoop集群。

（2）启动Hadoop集群

（3）执行WordCount程序

[atguigu@hadoop102 software]$ hadoop jar wc.jar

com.atguigu.mr.WordcountDriver /user/atguigu/input /user/atguigu/output 注意：com.atguigu.mr.WordcountDriver要和自己工程的全类名一致。

大数据培训课程WordCount案例实操相关推荐

大数据培训课程数据清洗案例实操-简单解析版
数据清洗(ETL) 在运行核心业务MapReduce程序之前,往往要先对数据进行清洗,清理掉不符合用户要求的数据.清理的过程往往只需要运行Mapper程序,不需要运行Reduce程序.大数据培训数据 ...
大数据之Spark案例实操完整使用(第六章)
大数据之Spark案例实操完整使用一.案例一 1.准备数据 2.需求 1:Top10 热门品类 3.需求说明方案一. 实现方案二实现方案三二 .需求实现 1.需求 2:Top10 热门品类中每 ...
MapReduce入门（一）—— MapReduce概述 + WordCount案例实操
MapReduce入门(一)-- MapReduce概述文章目录 MapReduce入门(一)-- MapReduce概述 1.1 MapReduce 定义 1.2 MapReduce 优缺点 1. ...
2. WordCount案例实操
文章目录 WordCount案例实操 1. 官方WordCount源码 2. 常用数据序列化类型 3. MapReduce编程规范 3.1 Mapper阶段 3.2 Reducer阶段 3.3 Dri ...
大数据框架介绍与实操
文章目录一. 大数据开源框架汇总简介 1.1 hadoop 1.2 hdfs 1.3 yarn 1.4 mapreduce 1.5 spark 1.6 hbase 1.7 zookeeper 1.8 ...
尚硅谷大数据技术Spark教程-笔记09【SparkStreaming（概念、入门、DStream入门、案例实操、总结）】
尚硅谷大数据技术-教程-学习路线-笔记汇总表[课程资料下载] 视频地址:尚硅谷大数据Spark教程从入门到精通_哔哩哔哩_bilibili 尚硅谷大数据技术Spark教程-笔记01[SparkCore ...
collect()案例和count()案例_大数据培训课程
collect()案例作用:在驱动程序中,以数组的形式返回数据集的所有元素. 需求:创建一个RDD,并将RDD内容收集到Driver端打印 (1)创建一个RDD scala> val rdd ...
countByKey()案例和foreach(func)案例_大数据培训课程
12 countByKey()案例作用:针对(K,V)类型的RDD,返回一个(K,Int)的map,表示每一个key对应的元素个数. 需求:创建一个PairRDD,统计每种key的个数 (1)创建一 ...
first()案例和take(n)案例_大数据培训课程
first()案例作用:返回RDD中的第一个元素需求:创建一个RDD,返回该RDD中的第一个元素 (1)创建一个RDD scala> val rdd = sc.parallelize(1 t ...

大数据培训课程WordCount案例实操

WordCount案例实操

大数据培训课程WordCount案例实操相关推荐

最新文章

热门文章