Before reading this post, please go through my previous post at “How MapReduce Algorithm Works” to get some idea about MapReduce Algorithm. My previous post has already explained about “How MapReduce performs WordCounting” in theoretically.

在阅读本文之前,请仔细阅读我以前的文章“ MapReduce算法的工作原理”,以获取有关MapReduce算法的一些想法。 我以前的文章已经从理论上解释了“ MapReduce如何执行字数统计”。

And if you are not familiar with HDFS Basic commands, please go through my post at “Hadoop HDFS Basic Developer Commands” to get some basic knowledge about how to execute HDFS Commands in CloudEra Environment.

而且,如果您不熟悉HDFS Basic命令,请浏览我的文章“ Hadoop HDFS Basic Developer Commands ”,以获取有关如何在CloudEra环境中执行HDFS命令的一些基本知识。

In this post, We are going to develop same WordCounting program using Hadoop 2 MapReduce API and test it in CloudEra Environment.

在本文中,我们将使用Hadoop 2 MapReduce API开发相同的WordCounting程序,并在CloudEra Environment中对其进行测试。

MapReduce WordCounting示例 (MapReduce WordCounting Example)

We need to write the following three programs to develop and test MapReduce WordCount example:

我们需要编写以下三个程序来开发和测试MapReduce WordCount示例:

  1. Mapper Program映射程序
  2. Reducer Program减速器程序
  3. Client Program客户程序

NOTE:-
To develop MapReduce Programs, there are two versions of MR API:

注意:-
要开发MapReduce程序,有两种版本的MR API:

  1. One from Hadoop 1.x (MapReduce Old API)一种来自Hadoop 1.x(MapReduce旧API)
  2. Another from Hadoop 2.x (MapReduce New API)另一个来自Hadoop 2.x(MapReduce New API)

In Hadoop 2.x, MapReduce Old API is deprecated. So we are gong to concentrate on MapReduce New API to develop this WordCount Example.

在Hadoop 2.x中,不赞成使用MapReduce Old API。 因此,我们非常关注MapReduce New API,以开发此WordCount示例。

In CloudEra environment, They have already provided Eclipse IDE setup with Hadoop 2.x API. So it is very easy to develop and test MapReduce Programs using this setup.

在CloudEra环境中,他们已经提供了带有Hadoop 2.x API的Eclipse IDE设置。 因此,使用此设置来开发和测试MapReduce程序非常容易。

To develop WordCount MapReduce Application, please use the following steps:

要开发WordCount MapReduce应用程序,请使用以下步骤:

  • Open Default Eclipse IDE provided by CloudEra Environment.打开CloudEra Environment提供的默认Eclipse IDE。
  • We can use already created project or create a new Java Project.我们可以使用已经创建的项目或创建新的Java项目。
  • For simplicity, I’m going to use existing “training” Java Project. They have already added all required Hadoop 2.x Jars to this project classpath. It is ready to use Eclipse Java Project.为简单起见,我将使用现有的“培训” Java项目。 他们已经将所有必需的Hadoop 2.x Jar添加到该项目类路径中。 准备使用Eclipse Java Project。
  • Create WordCount Mapper Program创建WordCount Mapper程序
  • Create WordCount Reducer Program创建WordCount Reducer程序
  • Create WordCount Client Program to test this application创建WordCount客户端程序以测试此应用程序

Let’s us start developing these three programs in next sections.

让我们在下一部分中开始开发这三个程序。

映射程序 (Mapper Program)

Create a “WordCountMapper” Java Class which extends Mapper class as shown below:

创建一个“ WordCountMapper” Java类,它扩展了Mapper类,如下所示:

package com.journaldev.hadoop.mrv1.wordcount;import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String w = value.toString();context.write(new Text(w), new IntWritable(1));}}

Code Explanation:

代码说明:

  • Our WordCountMapper class has implemented Hadoop 2 MapReduce API class “Mapper”.我们的WordCountMapper类已经实现了Hadoop 2 MapReduce API类“ Mapper”。
  • Mapper class has defined by using Generic Type as Mapper<LongWritable, Text, Text, IntWritable>Mapper类通过使用通用类型定义为Mapper <LongWritable,Text,Text,IntWritable>
  • Here <LongWritable, Text, Text, IntWritable>

    此处<LongWritable,Text,Text,IntWritable>

  1. First two <LongWritable, Text> represents Input Data types to our WordCount’s Mapper Program.前两个<LongWritable,Text>表示WordCount的Mapper程序的输入数据类型。
  2. For Example:- In our example, we will give a File(Huge amount of Data, any format). Mapper reads each line from this file and give one unique number as shown below

    例如:-在我们的示例中,我们将给出一个File(大量数据,任何格式)。 映射器从该文件读取每一行,并给出一个唯一的编号,如下所示

    <Unique_Long_Number, Line_Read_From_Input_File>

    In Hadoop MapReduce API, it is equal to <LongWritable, Text>.

    在Hadoop MapReduce API中,它等于<LongWritable,Text>。

  3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Mapper Program.最后两个<Text,IntWritable>表示我们的WordCount映射程序的输出数据类型。
  4. For Example:- In our example, WordCount’s Mapper Program gives output as shown below

    例如:-在我们的示例中,WordCount的Mapper程序给出如下所示的输出

    <Unique_Word_From_Input_File, Word_Count>

    In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

    在Hadoop MapReduce API中,它等于<Text,IntWritable>。

  • We have implemented Mapper’s map() method and provided our Mapping Function logic here.我们已经实现了Mapper的map()方法,并在此处提供了Mapping Function逻辑。
  • 减速器程序 (Reducer Program)

    Create a “WordCountReducer” Java Class which extends Reducer class as shown below:

    创建一个“ WordCountReducer” Java类,它扩展了Reducer类,如下所示:

    package com.journaldev.hadoop.mrv1.wordcount;import java.io.IOException;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Reducer;public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {sum += val.get();}context.write(key, new IntWritable(sum));}}

    Code Explanation:

    代码说明:

    • Our WordCountReducer class has extended Hadoop 2 MapReduce API class: “Reducer”.我们的WordCountReducer类扩展了Hadoop 2 MapReduce API类:“ Reducer”。
    • Reducer class has defined by using Generic Type as Mapper<Text, IntWritable, Text, IntWritable>Reducer类通过使用通用类型定义为Mapper <Text,IntWritable,Text,IntWritable>
    • Here <Text, IntWritable, Text, IntWritable>

      此处<文本,可写,文本,可写>

    1. First two <Text, IntWritable> represents Input Data types to our WordCount’s Reducer Program.前两个<Text,IntWritable>表示WordCount的Reducer程序的输入数据类型。
    2. For Example:- In our example, our Mapper Program will give <Text, IntWritable> output, which will become the input of Reducer Program.

      例如:-在我们的示例中,我们的Mapper程序将提供<Text,IntWritable>输出,该输出将成为Reducer程序的输入。

      <Unique_Word_From_Input_File, Word_Count>

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

      在Hadoop MapReduce API中,它等于<Text,IntWritable>。

    3. Last two <Text, IntWritable> represents Output Data types of our WordCount’s Reducer Program.最后两个<Text,IntWritable>表示WordCount的Reducer程序的输出数据类型。
    4. For Example:- In our example, WordCount’s Reducer Program gives output as shown below

      例如:-在我们的示例中,WordCount的Reducer程序给出如下所示的输出

      <Unique_Word_From_Input_File, Total_Word_Count>

      In Hadoop MapReduce API, it is equal to <Text, IntWritable>.

      在Hadoop MapReduce API中,它等于<Text,IntWritable>。

  • We have implemented Reducer’s reduce() method and provided our Reduce Function logic here.我们已经实现了Reducer的reduce()方法,并在此处提供了Reduce函数逻辑。
  • 客户程序 (Client Program)

    Create a “WordCountClient” Java Class with main() method as shown below:

    使用main()方法创建一个“ WordCountClient” Java类,如下所示:

    package com.journaldev.hadoop.mrv1.wordcount;import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class WordCountClient {public static void main(String[] args) throws Exception {Job job = Job.getInstance(new Configuration());job.setJarByClass(WordCountClient.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordCountMapper.class);job.setReducerClass(WordCountReducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.setInputPaths(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));boolean status = job.waitForCompletion(true);if (status) {System.exit(0);} else {System.exit(1);}}}

    Code Explanation:

    代码说明:

    • Hadoop 2 MapReduce API has “Job” class at “org.apache.hadoop.mapreduce” package.Hadoop 2 MapReduce API在“ org.apache.hadoop.mapreduce”包中具有“ Job”类。
    • Job Class is used to create Jobs (Map/Reduce Jobs) to perform our WordCounting tasks.作业类用于创建作业(映射/减少作业)以执行我们的字数统计任务。
    • Client program is using Job Object’s setter methods to set all MapReduce Components like Mapper, Reducer, Input Data Type, Output Data type etc.客户端程序正在使用Job Object的setter方法来设置所有MapReduce组件,例如Mapper,Reducer,输入数据类型,输出数据类型等。
    • These Jobs will perform our WordCounting Mapping and Reducing tasks.这些作业将执行我们的字数统计映射和精简任务。

    NOTE:-

    注意:-

    • As we discussed in my previous post, MapReduce algorithm uses 3 functions: Map Function, Combine Function and Reduce Function.正如我们在上一篇文章中讨论的那样,MapReduce算法使用3个函数:映射函数,合并函数和归约函数。
    • By observing these 3 programs, we can find out one thing that we have developed only only two functions : Map and Reduce. Then What about Combine function?通过观察这三个程序,我们可以发现仅开发了两个功能的一件事:Map和Reduce。 那合并功能呢?
    • That means we have used default Combine function logic available in Hadoop 2 MapReduce API.这意味着我们使用了Hadoop 2 MapReduce API中可用的默认Combine函数逻辑。
    • We will discuss on “How to develop Combine Function” in my coming posts.我们将在以后的文章中讨论“如何开发合并功能”。

    Now we have developed all required components (programs). It’s time to test it.

    现在,我们已经开发了所有必需的组件(程序)。 现在该进行测试了。

    测试MapReduce WordCounting示例 (Test MapReduce WordCounting Example)

    Our WordCounting project final structure looks like this:

    我们的WordCounting项目最终结构如下所示:

    Please use the following steps to test our MapReduce Application.

    请使用以下步骤测试我们的MapReduce应用程序。

    • Create our WordCount application JAR file using Eclipse IDE.使用Eclipse IDE创建我们的WordCount应用程序JAR文件。
    • Execute the following “hadoop” command to run our WordCounting Application执行以下“ hadoop”命令以运行我们的WordCounting应用程序
    • Syntax:-

      句法:-

    hadoop jar <our-Jar-file-path> <Client-program>  <Input-Path> <Output-Path>

    Let us assume that we have already created “/ram/mrv1/output” folder structure in Hadoop HDFS FileSytem. If you are not performed that, please go through my previous post at “Hadoop HDFS Basic Developer Commands” to create them.

    让我们假设我们已经在Hadoop HDFS FileSytem中创建了“ / ram / mrv1 / output”文件夹结构。 如果不执行该操作,请浏览我以前的文章“ Hadoop HDFS Basic Developer Commands ”以创建它们。

    Example:-

    例:-

    hadoop jar /home/cloudera/JDWordCountMapReduceApp.jar  com.journaldev.hadoop.mrv1.wordcount.WordCountClient /ram/mrv1/NASDAQ_daily_prices_C.csv/ram/mrv1/output

    NOTE:-
    Just for simple readability purpose, I’ve provided command into multiple lines. Please type this command in single line as shown below:

    注意:-
    出于简单易读的目的,我提供了多行命令。 请在单行中键入此命令,如下所示:

    By going through this log, we can observe that how Map and Reduce jobs work to solve our WordCounting problem.

    通过查看该日志,我们可以观察到Map and Reduce作业如何解决WordCounting问题。

  • Execute the following “hadoop” command to view the output directory content执行以下“ hadoop”命令以查看输出目录内容
  • hadoop fs -ls /ram/mrv1/output/

    It shows the content of “/ram/mrv1/output/” directory as shown below:

    它显示“ / ram / mrv1 / output /”目录的内容,如下所示:

  • Execute the following “hadoop” command to view our WordCounting Application output执行以下“ hadoop”命令以查看我们的WordCounting应用程序输出
  • hadoop fs -cat /ram/mrv1/output/part-r-00000

    This command displays WordCounting Application output. As my output file is too big, I’m not able to show you my file output here.

    此命令显示WordCounting应用程序输出。 由于我的输出文件太大,因此无法在此处显示我的文件输出。

    NOTE:-
    Here we have used some Hadoop HDFS commands to run and test our WordCounting Application. If you are not familiar with HDFS commands, please go through my “Hadoop HDFS Basic Developer Commands” post.

    注意:-
    在这里,我们使用了一些Hadoop HDFS命令来运行和测试WordCounting应用程序。 如果您不熟悉HDFS命令,请阅读我的“ Hadoop HDFS基本开发人员命令 ”一文。

    That’s it all about Hadoop 2.x MapReduce WordCounting Example. We will develop some more useful MapReduce programs in my coming posts.

    这就是有关Hadoop 2.x MapReduce WordCounting示例的全部内容。 在接下来的文章中,我们将开发一些更有用的MapReduce程序。

    Please drop me a comment if you like my post or have any issues/suggestions.

    如果您喜欢我的帖子或有任何问题/建议,请给我评论。

    翻译自: https://www.journaldev.com/8921/hadoop2-mapreduce-wordcounting-example

Hadoop 2.x MapReduce(MR V1)字数统计示例相关推荐

  1. MapReduce之WordCount字数统计

    第一次WordCount小游戏 在idea客户端上面进行WordCount统计 1:创建mapper类继承mapper(选hadoop类型) public class wordcountMapper ...

  2. Hadoop大数据--Mapreduce编程规范及入门示例

    Mapreduce是一个分布式的运算编程框架,核心功能是将用户编写的核心逻辑代码分布式地运行在一个集群的很多服务器上. Mapreduce的存在价值 (1)海量数据在单机上处理因为硬件资源限制,无法胜 ...

  3. Akka的字数统计MapReduce

    在我与Akka的日常工作中,我最近写了一个字数映射表简化示例. 本示例实现了Map Reduce模型,该模型非常适合横向扩展设计方法. 流 客户端系统(FileReadActor)读取文本文件,并将每 ...

  4. Hive报错:Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce

    最近在做hive练习题时使用hive查询报错 Hadoop job information for Stage-3: number of mappers: 0; number of reducers: ...

  5. 初学Hadoop之图解MapReduce与WordCount示例分析

    Hadoop的框架最核心的设计就是:HDFS和MapReduce.HDFS为海量的数据提供了存储,MapReduce则为海量的数据提供了计算. HDFS是Google File System(GFS) ...

  6. Hadoop学习之MapReduce

    Hadoop学习之MapReduce 目录 Hadoop学习之MapReduce 1 MapReduce简介 1.1 什么是MapReduce 1.2 MapReduce的作用 1.3 MapRedu ...

  7. 【Big Data - Hadoop - MapReduce】初学Hadoop之图解MapReduce与WordCount示例分析

    Hadoop的框架最核心的设计就是:HDFS和MapReduce.HDFS为海量的数据提供了存储,MapReduce则为海量的数据提供了计算. HDFS是Google File System(GFS) ...

  8. hadoop系列三:mapreduce的使用(一)

    一:说明 此为大数据系列的一些博文,有空的话会陆续更新,包含大数据的一些内容,如hadoop,spark,storm,机器学习等. 当前使用的hadoop版本为2.6.4 上一篇:hadoop系列二: ...

  9. Hadoop之图解MapReduce与WordCount示例分析

    Hadoop的框架最核心的设计就是:HDFS和MapReduce.HDFS为海量的数据提供了存储,MapReduce则为海量的数据提供了计算. HDFS是Google File System(GFS) ...

最新文章

  1. 自定义函数_自定义函数,让你的表格为所欲为
  2. layer.load 支持文字内容
  3. (转)Log4J使用笔记
  4. 在IIS上启用Gzip压缩 (HTTP压缩)方法
  5. 递归函数的理解 (三种类型)
  6. batch_normalization (bn)层以及实际使用中合并bn层
  7. CAS实现单点登录方案(SSO完整版)
  8. Elasticsearch 实战1:ES 项目实战(一)Java 集成 Spring Data Elasticsearch(一):简介及环境搭建
  9. 网页调用智能IC卡读写器的解决方案
  10. 软件工程设计概念与体系结构设计
  11. html 表单 元素 美化,分享10款jQuery的表单元素样式美化插件
  12. 小米bl未解锁变砖了如何刷机_如何正确刷机
  13. VBA学习笔记五---如何将宏代码进行共享(加载宏)
  14. SMOTE算法及其Python实现
  15. 怎样做风险评估?风险评估有哪些具体实施流程?
  16. Unity常用工作视图(上)(5大基本视图)
  17. pstack 安装linux_Linux下pstack的实现
  18. 宏定义 定义一年多少秒
  19. Linux nl 命令使用介绍
  20. 四川汶川地震涌现出的16个最牛

热门文章

  1. excel批量导入数据
  2. elementUI 日期选择控件少一天的问题解决方法
  3. jmeter正则中常见的转义字符-笔记三
  4. 自制操作系统-使用汇编显示 hello world
  5. Linux首次登陆设置root
  6. python os模块 os.chmod
  7. 分布式多级缓存中间件引导实践
  8. DevExpress XtraGrid网格控件示例四:初始化新建行的单元格
  9. HTML知识积累及实践(二) - 标签样式
  10. 舵机任意角度程序_【舵机初动】基于Mind+ Ardunio入门教程10