大数据技术——MapReduce词频统计

注：参考林子雨老师教程，具体请见

MapReduce编程实践(Hadoop3.1.3)_厦大数据库实验室博客

一.实验目的

1.理解Hadoop中MapReduce模块的处理逻辑｡

2.熟悉MapReduce编程｡

二.实验内容

1.新建文件夹input,并在其中创建三个指定文件名的文本文件,并将特定内容存入三个文本｡

2.启动Hadoop伪分布/全分布模式式,将input文件夹上传到HDFS上｡

3.编写MapReduce程序,实现单词出现次数统计｡统计结果保存到hdfs的output文件夹｡获取统计结果｡

三.实验环境

1.操作系统: Ubuntu 20.04(64 位)

2.虚拟机软件:VMware Workstation 15.5

3.jdk: jdk-8u202-linux-x64.tar.gz

4.Hadoop 版本 3.3.1

四.实验步骤

1.Eclipse中的项目创建及编写

import java.io.IOException;
import java.util.Iterator;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {public WordCount() {}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();String[] otherArgs = (new GenericOptionsParser(conf, args)).getRemainingArgs();if(otherArgs.length < 2) {System.err.println("Usage: wordcount <in> [<in>...] <out>");System.exit(2);}Job job = Job.getInstance(conf, "word count");job.setJarByClass(WordCount.class);job.setMapperClass(WordCount.TokenizerMapper.class);job.setCombinerClass(WordCount.IntSumReducer.class);job.setReducerClass(WordCount.IntSumReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class); for(int i = 0; i < otherArgs.length - 1; ++i) {FileInputFormat.addInputPath(job, new Path(otherArgs[i]));}FileOutputFormat.setOutputPath(job, new Path(otherArgs[otherArgs.length - 1]));System.exit(job.waitForCompletion(true)?0:1);}public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {private static final IntWritable one = new IntWritable(1);private Text word = new Text();public TokenizerMapper() {}public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) {this.word.set(itr.nextToken());context.write(this.word, one);}}}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {private IntWritable result = new IntWritable();public IntSumReducer() {}public void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {int sum = 0;IntWritable val;for(Iterator i$ = values.iterator(); i$.hasNext(); sum += val.get()) {val = (IntWritable)i$.next();}this.result.set(sum);context.write(key, this.result);}}
}

2.编译

3.程序运行

4.结果

五.完成情况与问题讨论

1.实际完成情况

对上述介绍的所有操作均进行了实验｡

2.问题与讨论

运行过程速度太慢,无法得出结果｡后检查发现原因在于本机硬件配置时资源分配不足,将分配资源增大后,运行成功｡

六.实践收获

1. 在使用 Eclipse 运行 MapReduce 程序时,会读取 Hadoop-Eclipse-Plugin 的 Advanced parameters 作为 Hadoop 运行参数,如果我们未进行修改,则默认的参数其实就是单机(非分布式)参数,因此程序运行时是读取本地目录而不是 HDFS 目录,就会提示 Input 路径不存在｡所以我们需要将配置文件复制到项目中的 src 目录,来覆盖这些参数｡让程序能够正确运行｡

2. 如果要再次运行WordCount.jar,需要首先删除HDFS中的output目录,否则会报错｡

大数据技术——MapReduce词频统计相关推荐

运用大数据技术揪出的犯罪分子居然是某知名电台女主持人
电影中有很多利用先进的IT技术破案的桥段,经常令我们打开眼界. 今天给大家介绍一个仅仅利用大数据技术中的统计方法就抓获犯罪嫌疑人的案例分享. 80,90后的我们都看过电影<死亡笔记>,又名 ...
图解大数据 | 应用Map-Reduce进行大数据统计@实操案例
作者:韩信子@ShowMeAI 教程地址:http://www.showmeai.tech/tutorials/84 本文地址:http://www.showmeai.tech/article-det ...
axure9数据统计插件_WMDA：大数据技术栈的综合实践
一.概述 WMDA是58自主开发的用户行为分析产品,同时也是一款支持无埋点的数据采集产品,只需要在第一次使用的时候加载一段SDK代码,即可采集全量.实时的PC.M.APP三端以及小程序的用户行为数据. ...
大数据技术之MapReduce
大数据技术之MapReduce 目录大数据技术之MapReduce 第 1 章 MapReduce 概述 1.1 MapReduce 定义 1.2 MapReduce 优缺点 1.2.1 优点 1. ...
大数据技术之Hadoop（MapReduce）
大数据技术之Hadoop(MapReduce) (作者:大数据研发部) 版本:V1.4 第1章MapReduce入门 map 计算 reduce 规约 1.1 MapReduce定义 Mapreduc ...
Hadoop技术内幕：深入解析MapReduce架构设计与实现原理 (大数据技术丛书) - 电子书下载（高清版PDF格式+EPUB格式）...
Hadoop技术内幕:深入解析MapReduce架构设计与实现原理 (大数据技术丛书)-董西成著在线阅读百度网盘下载(ihhy) 书名:Hadoop技术内幕:深 ...
《大数据技术原理与应用》（第七章 MapReduce 课后答案）
第七章 MapReduce 参考资料 1.林子雨_大数据技术原理与应用课后习题_NPU_阿夏的博客-CSDN博客 2.林子雨编著<大数据技术原理与应用(第3版)>教材官网_厦门大学数据库 ...
第一课大数据技术之Fink1.13的实战学习-部署使用和基础概念
第一课大数据技术之Fink1.13的实战学习文章目录第一课大数据技术之Fink1.13的实战学习第一节 Fink介绍 1.1 Flink介绍背景 1.2 Flink 的应用场景 1.3 流式 ...
大数据技术基础（一）
每个软件工程师都应该懂大数据技术软件编程技术出现已经半个多世纪了,核心价值就是把现实世界的业务操作搬到计算机上,通过计算机软件和网络进行业务和数据处理. 我们常见的软件系统,不管是电子商务还是库存管 ...

大数据技术——MapReduce词频统计

大数据技术——MapReduce词频统计相关推荐

最新文章

热门文章