Hadoop入门（二十三）Mapreduce的求数量最大程序

一、简介

在文件中统计出现最多个数的单词，将其输出到hdfs文件上。

二、例子

（1）实例描述
给出三个文件，每个文件中都若干个单词以空白符分隔，需要统计出现最多的单词

样例输入：
1）file1：

MapReduce is simple

2）file2：

MapReduce is powerful is simple

3）file3：

Hello MapReduce bye MapReduce

期望输出：

MapReduce      4

（2）问题分析
实现"统计出现最多个数的单词"只要关注的信息为：单词、词频。

（3）实现步骤

1）Map过程

首先使用默认的TextInputFormat类对输入文件进行处理，得到文本中每行的偏移量及其内容。显然，Map过程首先必须分析输入的<key,value>对，得到倒排索引中需要的三个信息：单词、词频

2）Combine过程
经过map方法处理后，Combine过程将key值相同的value值累加，得到一个单词在文档在文档中的词频，输出作为Reduce过程的输入。

3）Reduce过程
经过上述两个过程后，Reduce过程只需将相同key值的value值累加，保留最大词频的单词输出。

（4）代码实现

package com.mk.mapreduce;import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import java.io.IOException;
import java.net.URI;
import java.util.*;public class MaxWord {public static class MaxWordMapper extends Mapper<LongWritable, Text, Text, IntWritable> {private final Text newKey = new Text();private final IntWritable newValue = new IntWritable(1);@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {if (StringUtils.isBlank(value.toString())) {System.out.println("空白行");return;}StringTokenizer tokenizer = new StringTokenizer(value.toString());while (tokenizer.hasMoreTokens()) {String word = tokenizer.nextToken();newKey.set(word);context.write(newKey, newValue);}}}public static class MaxWordCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {private final IntWritable newValue = new IntWritable();@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int count = 0;for (IntWritable v : values) {count += v.get();}newValue.set(count);context.write(key, newValue);}}public static class MaxWordReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private String word = null;private int count = 0;@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int c = 0;for (IntWritable v : values) {c += v.get();}if (word == null || count < c) {word = key.toString();count = c;}}@Overrideprotected void cleanup(Context context) throws IOException, InterruptedException {if (word != null) {context.write(new Text(word), new IntWritable(count));}}}public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {String uri = "hdfs://192.168.150.128:9000";String input = "/maxWord/input";String output = "/maxWord/output";Configuration conf = new Configuration();if (System.getProperty("os.name").toLowerCase().contains("win"))conf.set("mapreduce.app-submission.cross-platform", "true");FileSystem fileSystem = FileSystem.get(URI.create(uri), conf);Path path = new Path(output);fileSystem.delete(path, true);Job job = new Job(conf, "MaxWord");job.setJar("./out/artifacts/hadoop_test_jar/hadoop-test.jar");job.setJarByClass(MaxWord.class);job.setMapperClass(MaxWordMapper.class);job.setCombinerClass(MaxWordCombiner.class);job.setReducerClass(MaxWordReducer.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(IntWritable.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileInputFormat.addInputPaths(job, uri + input);FileOutputFormat.setOutputPath(job, new Path(uri + output));boolean ret = job.waitForCompletion(true);System.out.println(job.getJobName() + "-----" + ret);}
}

Hadoop入门（二十三）Mapreduce的求数量最大程序相关推荐

Hadoop入门(二)——VMware虚拟网络设置+Windows10的IP地址配置+CentOS7静态IP设置（图文详解步骤2021）
Hadoop入门(二)--VMware虚拟网络设置+Windows10的IP地址配置+CentOS7静态IP设置(图文详解步骤2021) 之前在上一篇文章中讲述了 CentOS7下载+VM上安装(手动 ...
Hadoop入门（二十二）Mapreduce的求平均值程序
一.简介求平均值是统计中最常使用到的,现在使用Mapreduce在海量数据中统计数据的求平均值. 二.例子 (1)实例描述给出三个文件,每个文件中都存储了若干个数值,求所有数值中的求平均值. 样例 ...
Hadoop入门（十三）远程提交wordCout程序到hadoop集群
一.项目结构用到的文件有WordCount.java.core-site.xml.mapreduce-site.xml.yarn-site.xml.log4j.properties.pom.xml ...
汇编语言（二十三）之求一个数的补数
给定一个数,求该数的补数程序运行: 代码: datas segmentNUM DB 12H,34H,56H,78H,9AH,0BCH,23H,45HDB 67H,89H,0DEH,13H,24H,3 ...
CarSim仿真快速入门(二十三)-CarSimSimulink联合仿真中的 S-Function模块
Simulink S-Function模块为了与外部软件如Simulink一起工作,VS数学模型从一个包装模块中运行,该模块将使用的VS库连接到其他环境.封装器以该环境的标准方式与调用环境进行通信. ...
2021年大数据Hadoop（二十二）：MapReduce的自定义分组
全网最详细的Hadoop文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录本系列历史文章前言 MapReduce的自定义分组需求分析实现第一步: ...
2021年大数据Hadoop（二十九）：关于YARN常用参数设置
全网最详细的Hadoop文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录本系列历史文章前言关于yarn常用参数设置设置container分配最小内 ...
2021年大数据Hadoop（二十七）：YARN运行流程
全网最详细的Hadoop文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录本系列历史文章前言 Yarn运行流程本系列历史文章 2021年大数据Hado ...
2021年大数据Hadoop（二十六）：YARN三大组件介绍
全网最详细的Hadoop文章系列,强烈建议收藏加关注! 后面更新文章都会列出历史文章目录,帮助大家回顾知识重点. 目录本系列历史文章前言 Yarn三大组件介绍 ResourceManager No ...

Hadoop入门（二十三）Mapreduce的求数量最大程序

一、简介

二、例子

Hadoop入门（二十三）Mapreduce的求数量最大程序相关推荐

最新文章

热门文章