使用Hadoop来分析数据

使用Mapreduce规范进行编程，本地测试后部署到集群上

两个阶段：

两个阶段均以键值对作为输入、输出。键是某一位置相对于文件起始位置的偏移量

Map阶段：数据准备
- 去除已损数据，筛掉缺失的、可疑的、错误的数据。
- 提取年份和气温信息，并将其作为输出。
- map函数输出经过MapReduce框架处理后，发送到reduce函数。
Reduce阶段：算法设计
- 找出每年的最高气温。
- 基于键值进行排序和分组，输入：键是年份，值是当年所有气温。
- 输出：（年，当年最高气温）

Java MapReduce

Map函数，由Mapper类实现

import java.io.IOException;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;public class MaxTemperatureMapperextends Mapper<LongWritable, Text, Text, IntWritable> {private static final int MISSING = 9999;@Overridepublic void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {String line = value.toString();String year = line.substring(15, 19);int airTemperature;if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signsairTemperature = Integer.parseInt(line.substring(88, 92));} else {airTemperature = Integer.parseInt(line.substring(87, 92));}String quality = line.substring(92, 93);if (airTemperature != MISSING && quality.matches("[01459]")) {//We write an output record only if the temperature is//present and the quality code indicates the temperature reading is OK.context.write(new Text(year), new IntWritable(airTemperature));}}}

The Mapper class is a generic type(泛型), with four formal type parameters that specify the input
key, input value, output key, and output value types of the map function.

输入键：长整数偏移量可以见上图 LongWritable

输入值：一行文本

输出键：年份

输出值：气温

hadoop本身的基本类型：这些类型都在org.apache.hadoop.io包中

LongWritable --> java long

Text —> java String

IntWritable —> java Integer

The map() method is passed a key and a value. Context用于输出内容的写入

Reduce函数

import java.io.IOException;import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;public class MaxTemperatureReducerextends Reducer<Text, IntWritable, Text, IntWritable> {@Overridepublic void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException {int maxValue = Integer.MIN_VALUE;for (IntWritable value : values) {maxValue = Math.max(maxValue, value.get());}context.write(key, new IntWritable(maxValue));}
}

reduce函数也有四个参数，必须匹配map函数的数据类型。

运行作业的代码

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MaxTemperature {public static void main(String[] args) throws Exception {if (args.length != 2) {System.err.println("Usage: MaxTemperature <input path> <output path>");System.exit(-1);//要指定好输入输出路径}Job job = new Job();job.setJarByClass(MaxTemperature.class);//Hadoop可以利用传进来的类来找到相关的jar文件job.setJobName("Max temperature");FileInputFormat.addInputPath(job, new Path(args[0]));//设置输入路径，可以是单个文件或者                                                        //目录，可以多次调用以实现多文件输入FileOutputFormat.setOutputPath(job, new Path(args[1]));//设置输出路径，作业运行前这个目                                                            //录是不应该存在的，防止数据丢失job.setMapperClass(MaxTemperatureMapper.class);//设置map类job.setReducerClass(MaxTemperatureReducer.class);//设置reduce类job.setOutputKeyClass(Text.class);//设置输出键job.setOutputValueClass(IntWritable.class);//设置输出内容System.exit(job.waitForCompletion(true) ? 0 : 1);//返回值表示执行成功或者失败}
}

Job对象控制指定作业执行规范，控制整个作业的运行。

运行测试

运行测试使用小数据即可，单机测试

% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop MaxTemperature input/ncdc/sample.txt output
#运行这些命令需要在范例所在的文件夹下才可以
attempt_local26392882_0001_m_000000_0
#map函数的ID
attempt_local26392882_0001_r_000000_0
#reduce函数的ID
#The last section of the output, titled “Counters,” shows the statistics that Hadoop #generates for each job it runs. These are very useful for checking whether the amount of #data processed is what you expected. For example, we can follow the number of records #that went through the system: five map input records produced five map output records
#(since the mapper emitted one output record for each valid input record), then five #reduce input records in two groups (one for each unique key) produced two reduce output #records.
% cat output/part-r-00000
1949 111
1950 22
#最终的输出

Hadoop 权威指南第三版中文版
提取码：7fzt

单机模式运行hadoop，来自《Hadoop权威指南》相关推荐

[喵咪KafKa(2)]单机模式运行KafKa
2019独角兽企业重金招聘Python工程师标准>>> [喵咪KafKa(2)]单机模式运行KafKa# 前言## 在上节我们介绍完KafKa之后,今天我们来搭建KafKa三种模式( ...
Apache Solr 9.1-（一）初体验单机模式运行
Apache Solr 9.1-(一)初体验单机模式运行 Solr是一个基于Apache Lucene的搜索服务器,Apache Lucene是开源的.基于Java的信息检索库,Solr能为用户提供无 ...
【云计算平台】Hadoop单机模式环境搭建
Centos7环境 – Hadoop单机模式部署正文开始@Assassin 目录: Centos7环境 -- Hadoop单机模式部署 1. Hadoop介绍: 2. Hadoop发展史及生态圈: ...
《Hadoop权威指南》第三章 Hadoop分布式文件系统
<Hadoop权威指南>第三章 Hadoop分布式文件系统目录前言 HDFS的设计 HDFS的概念命令行接口 Hadoop文件系统 Java接口数据流通过distcp并行复制注 ...
《Hadoop权威指南》读书笔记1
<Hadoop权威指南>读书笔记 Day1 第一章 1.MapReduce适合一次写入.多次读取数据的应用,关系型数据库则更适合持续更新的数据集. 2.MapReduce是一种线性的可伸缩 ...
Hadoop权威指南:HDFS-目录,查询文件系统,删除文件
目录 Hadoop权威指南:HDFS-目录,查询文件系统,删除文件目录查询文件系统文件元数据:FileStatus 列出文件文件模式 PathFilter对象删除数据 Hadoop权威指南: ...
Hadoop权威指南学习笔记三
HDFS简单介绍声明:本文是本人基于Hadoop权威指南学习的一些个人理解和笔记,仅供学习參考.有什么不到之处还望指出,一起学习一起进步. 转载请注明:http://blog.csdn.net/my ...
《Hadoop权威指南》第二章关于MapReduce
<Hadoop权威指南>第二章关于MapReduce 目录使用Hadoop来数据分析横向扩展注:<Hadoop权威指南>重点学习摘要笔记 1. 使用Hadoop来数据分 ...
《Hadoop 权威指南》读书笔记之七 — chapter7
<Hadoop 权威指南>读书笔记之七 - chapter7[updating-] The whole process of MapReduce at the highes level,t ...

单机模式运行hadoop，来自《Hadoop权威指南》

使用Hadoop来分析数据

两个阶段：

Java MapReduce

运行测试

单机模式运行hadoop，来自《Hadoop权威指南》相关推荐

最新文章

热门文章