MapReduce三个小案例

回顾一下 wordcount案例中map阶段

回顾一下 wordcount案例中的reduce阶段

1、IK分词器(统计三国演义指定词语个数)

步骤一:找到ik依赖,并添加到环境中

步骤二:在hadoop项目中创建子项目,并添加环境依赖

步骤三:小测试一下,结果如下:

步骤四:通过程序在 HDFS 中的 sgyy.txt 中统计 三国演义出现的次数

《三国演义》是中国文学史上第一部章回小说,是历史演义小说的开山之作,也是第一部文人长篇小说,明清时期甚至有“第一才子书”之称。 [1] 《三国演义》(全名为《三国志通俗演义》,又称《三国志演义》)是元末明初小说家罗贯中根据陈寿《三国志》和裴松之注解以及民间三国故事传说经过艺术加工创作而成的长篇章回体历史演义小说,与《西游记》《水浒传》《红楼梦》并称为中国古典四大名著。该作品成书后有嘉靖壬午本等多个版本传于世,到了明末清初,毛宗岗对《三国演义》整顿回目、修正文辞、改换诗文,该版本也成为诸多版本中水平最高、流传最广

等等 。。。。

首先:配置依赖开始第一次尝试,失败。原因是没有添加能够将 IKanalyzer 打包的依赖。下面是配置了将依赖一起打包的代码

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><parent><artifactId>java-hadoop</artifactId><groupId>org.example</groupId><version>1.0-SNAPSHOT</version></parent><modelVersion>4.0.0</modelVersion><artifactId>hadoop-sanguodemo</artifactId><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target></properties><dependencies><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId></dependency><!-- https://mvnrepository.com/artifact/com.janeluo/ikanalyzer --><dependency><groupId>com.janeluo</groupId><artifactId>ikanalyzer</artifactId></dependency></dependencies><build><plugins><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-assembly-plugin</artifactId><version>3.3.0</version><configuration><descriptorRefs><!--  打包出来的带依赖jar包名称 --><descriptorRef>jar-with-dependencies</descriptorRef></descriptorRefs></configuration><!--下面是为了使用 mvn package命令,如果不加则使用mvn assembly--><executions><execution><id>make-assemble</id><phase>package</phase><goals><goal>single</goal></goals></execution></executions></plugin></plugins></build></project>

其次:编写主程序(main方法中的代码千篇一律,但是调用的map 和reduce 的代码是关键代码)

package com.shujia;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.wltea.analyzer.core.IKSegmenter;
import org.wltea.analyzer.core.Lexeme;
import java.io.IOException;
import java.io.StringReader;
/*** @author WangTao* @date 2022/5/27 21:04*/
class MyMapper extends Mapper<LongWritable,Text,Text,LongWritable>{@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {//使用分词器将一行数据分词StringReader sr = new StringReader(value.toString());//使用IK分词器进行分词IKSegmenter ikSegmenter = new IKSegmenter(sr, true);Lexeme lexeme = null;while ((lexeme = ikSegmenter.next())!=null){String word = lexeme.getLexemeText();if("三国演义".equals(word)||"曹操".equals(word)){context.write(new Text(word),new LongWritable(1L));}}}
}
class MyReduce extends Reducer<Text,LongWritable,Text,LongWritable>{@Overrideprotected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {Long sum = 0L;for (LongWritable value : values) {long l = value.get();sum += l;}context.write(key,new LongWritable(sum));}
}
/*统计sgyy.txt中的 三国演义 出现的次数*/
public class SanGuoCount {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setNumReduceTasks(2);job.setJobName("统计三国中三国演义和曹操出现的此时");job.setJarByClass(SanGuoCount.class);job.setMapperClass(MyMapper.class);job.setReducerClass(MyReduce.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));job.waitForCompletion(true);}
}

然后将写好的代码打包,上传到linux中,通过命令运行

hadoop jar <上传的jar包> main方法的 Reference hdfs中源文件的路径 目标路径、

运行结果如下:

2、MapReduce案例

好友推荐系统

固定类别推荐

​ 莫扎特---->钢琴---->贝多芬----->命运交响曲

数据量

​ QQ好友推荐--->

​ 每个QQ200个好友

​ 5亿QQ号

解决思路:

​ 需要按照行进行计算

​ 将相同推荐设置成相同的key,便于reduce统一处理

数据:

tom hello hadoop cat
world hello hadoop hive
cat tom hive
mr hive hello
hive cat hadoop world hello mr
hadoop tom hive world hello
hello tom world hive mr

分析:

我们需要在map阶段根据用户的直接联系和间接关系列举出来,map输出的为tom:hadoop 1,hello:hadoop 0,0代表间接关系,1代表直接关系。在Reduce阶段把直接关系的人删除掉,再输出。

具体实现

环境代码

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><parent><artifactId>java-hadoop</artifactId><groupId>org.example</groupId><version>1.0-SNAPSHOT</version></parent><modelVersion>4.0.0</modelVersion><artifactId>anli-tuijian</artifactId><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target></properties><dependencies><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId></dependency></dependencies></project>

主程序代码:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;class MyMapper extends Mapper<LongWritable,Text,Text,LongWritable>{private Text mkey = new Text();private  LongWritable mvalue = new LongWritable();@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {//tom hello hadoop cat/*** tom:hello 1* tom:hadoop 1* tom:cat 1** hello:hadoop -1* hello:cat -1* hadoop:cat -1*///hadoop tom hive world hello/*** hadoop:tom 1* hadoop:hive 1* hadoop:world 1* hello:hadoop 1*///将value类型的数据转换成String类型String line = value.toString();String[] strings = line.split(" ");for (int i = 1; i < strings.length; i++) {//处理直接好友mkey.set(orderFriend(strings[0],strings[i]));mvalue.set(1L);context.write(mkey,mvalue);//处理间接好友for(int j =i+1; j < strings.length; j++) {mkey.set(orderFriend(strings[i],strings[j]));mvalue.set(-1L);context.write(mkey, mvalue);}}}public static String orderFriend(String f1,String f2){int i = f1.compareTo(f2);if(i>0){return f1+":"+f2;}else{return f2+":"+f1;}}
}
class MyReduce extends Reducer<Text,LongWritable,Text,LongWritable>{protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {//hadoop:hello 1//hadoop:hello -1int flag = 0;Long sum = 0L;for (LongWritable value : values) {if (value.get() == 1) {flag = 1;}sum += value.get();}if (flag == 0) {context.write(key, new LongWritable(-1L));} else {context.write(key, new LongWritable(0L));}}
}/*** @author WangTao* @date 2022/5/28 10:07*/
/*计算出给出的数据中有多少直接关系的组合和间接关系的组合,将组合作为key直接关系,value是1间接关系,value是-1*/
public class TuiJianDemo {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(TuiJianDemo.class);job.setNumReduceTasks(1);job.setJobName("推荐任务");job.setMapperClass(MyMapper.class);job.setReducerClass(MyReduce.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));job.waitForCompletion(true);}
}

结果如下:

4 PM2.5平均值

表格如下:.csv结尾的都是以 , 间隔的

添加依赖

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><parent><artifactId>java-hadoop</artifactId><groupId>org.example</groupId><version>1.0-SNAPSHOT</version></parent><modelVersion>4.0.0</modelVersion><artifactId>airPM25avg</artifactId><properties><maven.compiler.source>8</maven.compiler.source><maven.compiler.target>8</maven.compiler.target></properties><dependencies><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId></dependency><!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId></dependency></dependencies>
</project>

代码如下:

package com.shujia.airPM25;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
//20220527:1001 20
//20220527:1001 30
//20220527:1001 40
//...
class MyMapper extends Mapper<LongWritable,Text,Text,LongWritable>{@Overrideprotected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {//将一行的数据转换成 String 类型的String line = value.toString();String[] strings = line.split(",");//过滤出 Pm2.5 的对应的数据if(strings.length>=4 && "PM2.5".equals(strings[2])){for(int i = 3, j = 1001;i < strings.length;i++,j++){//对一行的数据进行清洗,因为有的时间没有监控到PM2.5的值if("".equals(strings[i]) || strings[i] ==null || " ".equals(strings[i])){strings[i] = "0";}context.write(new Text("date:"+strings[0]+"-城市编号:"+j),new LongWritable(Long.parseLong(strings[i])));}}}
}
class MyReducer extends Reducer<Text,LongWritable,Text,LongWritable>{@Overrideprotected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {Long sum = 0L;for (LongWritable value : values) {long l = value.get();sum += l;}//除以 24 得到该城市当天的 PM2.5 平均值long avg = sum/24;context.write(key,new LongWritable(avg));}
}
/*** @author WangTao* @date 2022/5/28 14:11*/
public class AirPm25avg {public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {Configuration conf = new Configuration();Job job = Job.getInstance(conf);job.setJarByClass(AirPm25avg.class);job.setNumReduceTasks(2);job.setJobName("计算每个城市每天的pm2.5的平均值");job.setMapperClass(MyMapper.class);job.setReducerClass(MyReducer.class);job.setMapOutputKeyClass(Text.class);job.setMapOutputValueClass(LongWritable.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(LongWritable.class);FileInputFormat.setInputPaths(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));job.waitForCompletion(true);}}

注意:args[] 的位置不仅可以填写文件,而且可以填写文件夹

结果:

Hadoop-day07(MapReduce三个小案例)相关推荐

  1. vue3 (三) 小案例 简易小书架

    案例效果: html <!DOCTYPE html> <html lang="en"> <head><meta charset=" ...

  2. PyAutoGUI 自动控制鼠标和键盘操作(三个小案例)

    PyAutoGUI 安装 用pip直接安装,可能错误不断,基本都是版本的问题,我是这样安装的,可以尝试一下,能不能成功 pip install -i https://pypi.doubanio.com ...

  3. 微信小程序云开发-内容管理系统(CMS)小案例

    一.前言 最近有朋友问到如何在CMS上实现审核小程序端提交的数据,例如发布的内容审核,订单审核等等.因而出一篇小案例. 二.原理说明 这个由腾讯开源的这个内容管理系统,实际上就是对于云开发数据库的可视 ...

  4. ajax入门学习(二)小案例Demo1

    (尊重劳动成果,转载请注明出处:http://blog.csdn.net/qq_25827845/article/details/55823800冷血之心的博客) ajax入门学习(一) ajax入门 ...

  5. hadoop之mapreduce教程+案例学习(二)

    第3章 MapReduce框架原理 目录 第3章 MapReduce框架原理 3.1 InputFormat数据输入 3.1.1 切片与MapTask并行度决定机制 3.1.2 Job提交流程源码和切 ...

  6. hadoop之mapreduce教程+案例学习(一)

    第1章 MapReduce概述 目录 第1章 MapReduce概述 1.1 MapReduce定义 MapReduce是一个分布式运算程序的编程框架,是用户开发"基于Hadoop的数据分析 ...

  7. Hadoop之mapreduce 实例三

    Hadoop之mapreduce 实例三 转载于:https://www.cnblogs.com/chaoren399/archive/2013/01/04/2844503.html

  8. 使用idea编写消费者,接收生产者的持续日志输出【小案例】(三)

    在   用idea编写代码作为生产者,Kafka接收其[持续]发来的广告日志信息[小案例](二) https://georgedage.blog.csdn.net/article/details/10 ...

  9. 【MapReduce】Hadoop的序列化机制以及序列化案例求解每个部门工资总额

    Hadoop的序列化机制以及序列化案例求解每个部门工资总额 1 Hadoop的序列化 1.1 序列化定义 1.2 Java序列化编程 1.3 hadoop序列化编程 2 序列化求解每个部门工资总额 手 ...

  10. Hadoop编写MapReduce之入门案例WordCount(自带+编写java)

    目录标题 Hadoop自带案例WordCount运行 MapReduce可以很好地应用于各种计算问题 网页查看 集群上jar包的位置 程序的介绍 自己编写WordCount的project(MapRe ...

最新文章

  1. Unity3D脚本属性
  2. 图论 ---- B. Graph Subset Problem (图中找k阶完全子图 or 找一个子集里面的点的度数都打过k)
  3. linux gdb调试问题汇总
  4. oracle主键增长方式,oracle 自增长主键
  5. QImage 与 cv::Mat 之间的相互转换
  6. 部门微服务:基本dao和service代码编写
  7. js带开关的时钟_昌利JS系列混凝土搅拌机衬板配件
  8. Kinect开发笔记之五使用PowerShell控制Kinect
  9. FTP服务器的搭建与安全配置
  10. 使用Android自带DownloadManager下载文件
  11. java类的生命周期
  12. 三菱q plc modbus通讯协议详解_三菱PLC串口通讯模块QJ71C24N和MR-JE-A的modbus通讯
  13. OsiriX User Manual - 3D Curved MPR(曲面重建)
  14. 斯坦福大学最新-机器学习导论
  15. Word 模板渲染引擎-Poi-tl
  16. Ionic开发框架的安装及Ionic项目的创建
  17. 日期格式化中的大小写区别
  18. “趁年轻,浙样爱”陈晓景甜领衔春日恋爱宣言
  19. golang使用go Modules相关命令时报错
  20. 2017年下半年中学综合素质问答题

热门文章

  1. gSOAP+onvif初探(三):PTZ控球
  2. 双网络安全nvr/布控球,可双向同时接入国网B接口视频监控平台和国标28181平台
  3. 2db多少功率_功率换算(dB与W).doc
  4. 怎么转换书法字体?教你快速转换毛笔字体
  5. 面向对象,继承封装的应用
  6. Chrome浏览器关闭地址栏的搜索记录
  7. ArcGIS最详细的地图制作教程
  8. 2019全网最全大数据系统学习路径
  9. 7-9 幂集(回溯法) (20 分)(C语言版)
  10. 植物大战僵尸修改金币以及关卡