【hadoop权威指南第四版】第四章hadoop的IO【笔记+代码】

4.1数据完整性

检测损坏数据的常用方法是在第一次进入系统时计算数据的校验和，如果传输后新生成的校验和不完全匹配原始的校验和，那么数据就会被认为是损坏了。

注意，校验和可能会错，数据却是正确的，但这种可能性不大，因为校验和远小于数据。

一个常用的数据检测代码是CRC-32（cyclic redundancy check，循环冗余检查），计算一个32位的任何大小输入的整数校验和。

4.4.1 HDFS的数据完整性

客户端写入数据并且将它发送到一个数据节点的管线中，在管线上的最后一个数据节点验证校验和。如果检测到错误，客户端便会受到一个checksum Exception，这是IOException的一个子类。

客户端读取数据节点上的数据时，会验证校验和，将其与数据节点上存储的校验和进行对比。每个数据节点维护一个连续的校验和验证日志，因此他知道每个数据块最后验证的时间。客户端成功验证数据块后，便会告诉数据节点，后者便随之更新日志。保持这种统计，对检测损坏磁盘是很有价值的。

吃了对客户端读取数据进行验证，每个数据节点还会在后台线程运行一个DataBlockScanner(数据块检测程序)，定期验证存储在数据节点上的所有块。这是为了防止物理存储介质中位衰减所造成的数据损坏。

HDFS 有着块的副本，可以直接复制过来作为纠正。

其工作方式

如果客户端检测出错误，抛出checksum Exception 前报告该坏块以及它识图从名称节点读取的数据节点。名称节点将这个块标记为损坏的，因此它就不会直接复制给客户端，或者复制次副本到其他地方。他会从其他副本复制一个新的副本，损坏的副本将被删除。

但是，有些文件不想被自动删除，或许他可能可以挽救？

我们可以在使用open（）方法来读取文件之前，通过false传给filesystem中的setverichecksum（）方法来禁用校验和验证。若是在shell命令的话，可以在-get或者其等效的-copyTOLocal命令中使用-ignoreCrc选项。

4.2 压缩

更快的压缩和解压速度通常会耗费更多的空间。（时空的权衡）

#最快的压缩方法, 和ZIP是通用的压缩工具，在时空处理上相对平衡。
gzip -l file

LZO使用速度最优算法，但是压缩效率稍低。

4.2.1 编码、解码器

压缩从标准输入读取的数据并将它写到标准输出

public class StreamCompressor {public static void main(String[] args) throws Exception {String codecClassname = args[0];Class<?> codecClass = Class.forName(codecClassname);Configuration conf = new Configuration();本文档由Linux公社 www.linuxidc.com 收集整理CompressionCodec codec = (CompressionCodec)ReflectionUtils.newInstance(codecClass, conf);CompressionOutputStream out = codec.createOutputStream(System.out);IOUtils.copyBytes(System.in, out, 4096, false);out.finish();}
}

使用以上StreamCompressor程序与GunCodec压缩字符串“text”，然后使用gunzip从标准输入对它进行解压缩操作。

% echo "Text" | hadoop StreamCompressor org.apache.hadoop.io.compress.GzipCodec \
| gunzip -
Text

4.2.3 在MR中使用压缩

根据文件的扩展名，利用编码、译码器对压缩文件进行解压

public class FileDecompressor {public static void main(String[] args) throws Exception {String uri = args[0];Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(uri), conf);Path inputPath = new Path(uri);CompressionCodecFactory factory = new CompressionCodecFactory(conf);CompressionCodec codec = factory.getCodec(inputPath);if (codec == null) {System.err.println("No codec found for " + uri);System.exit(1);}String outputUri = CompressionCodecFactory.removeSuffix(uri, codec.getDefaultExtension());InputStream in = null;OutputStream out = null;try {in = codec.createInputStream(fs.open(inputPath));out = fs.create(new Path(outputUri));IOUtils.copyBytes(in, out, conf);} finally {IOUtils.closeStream(in);IOUtils.closeStream(out);}}
}

% hadoop FileDecompressor file.gz

使用压缩池程序来压缩从标准输入读入后将其写入标准输出的数据

public class MaxTemperatureWithCompression {public static void main(String[] args) throws Exception {if (args.length != 2) {System.err.println("Usage: MaxTemperatureWithCompression <input path> " + "<output path>");System.exit(-1);}Job job = new Job();job.setJarByClass(MaxTemperature.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);FileOutputFormat.setCompressOutput(job, true);FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);job.setMapperClass(MaxTemperatureMapper.class);job.setCombinerClass(MaxTemperatureReducer.class);job.setReducerClass(MaxTemperatureReducer.class);System.exit(job.waitForCompletion(true) ? 0 : 1);}
}

产生压缩的输出结果

usage

% hadoop MaxTemperatureWithCompression input/ncdc/sample.txt.gz output

result

% gunzip -c output/part-r-00000.gz
1949 111
1950 22

4.3 序列化

特点

紧凑的

一个紧凑的格式使网络带宽的到充分利用，带宽是数据中心中最稀缺的资源。
快速

进程间的通信是分布式系统的骨干，因此它必须尽量减少序列化和反序列化开销。
可扩展

协议随时间而变以满足新的要求，因此它应该直接演变为客户端和服务器端的控制协议。例如他应该可以加入一个新的参数方法调用，并且有新的服务器来接收来自老客户端的旧格式消息（不包括新的参数）。
互操作性

对于某些系统，最好能够支持用不同的语言编写的客户端被写入服务器端，所以需要为此而精心设计文件格式。

hadoop使用自己的序列化格式writeables，他紧凑、快速（当不容易扩展或java之外的语言）。由于writeables是hadoop的和兴（MR程序使用它来序列化键值对）。

自定义writeable，一对字符串

import java.io.*;
import org.apache.hadoop.io.*;public class TextPair implements WritableComparable<TextPair> {private Text first;private Text second;public TextPair() {set(new Text(), new Text());}public TextPair(String first, String second) {set(new Text(first), new Text(second));}public TextPair(Text first, Text second) {set(first, second);}public void set(Text first, Text second) {this.first = first;this.second = second;}public Text getFirst() {return first;}public Text getSecond() {return second;}@Overridepublic void write(DataOutput out) throws IOException {first.write(out);second.write(out);}@Overridepublic void readFields(DataInput in) throws IOException {first.readFields(in);second.readFields(in);}@Overridepublic int hashCode() {return first.hashCode() * 163 + second.hashCode();}@Overridepublic boolean equals(Object o) {if (o instanceof TextPair) {TextPair tp = (TextPair) o;return first.equals(tp.first) && second.equals(tp.second);}return false;}@Overridepublic String toString() {return first + "\t" + second;}@Overridepublic int compareTo(TextPair tp) {int cmp = first.compareTo(tp.first);if (cmp != 0) {return cmp;}return second.compareTo(tp.second);}
}

4.4 基于文件的数据结构

对于某些应用，需要一个特殊的数据结构来存储数据，针对运行基于MR的进程，将每个二进制数据块放入他自己的文件，这样做不易于扩展，所以hadoop开发了一系列高级容器。

编写一个sequenceFile类

public class SequenceFileWriteDemo {private static final String[] DATA = { "One, two, buckle my shoe", "Three, four, shut the door","Five, six, pick up sticks", "Seven, eight, lay them straight", "Nine, ten, a big fat hen" };public static void main(String[] args) throws IOException {String uri = args[0];Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(uri), conf);Path path = new Path(uri);IntWritable key = new IntWritable();Text value = new Text();SequenceFile.Writer writer = null;try {writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());for (int i = 0; i < 100; i++) {key.set(100 - i);value.set(DATA[i % DATA.length]);System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);writer.append(key, value);}} finally {IOUtils.closeStream(writer);}}
}

读取一个序列文件

public class SequenceFileReadDemo {public static void main(String[] args) throws IOException {String uri = args[0];Configuration conf = new Configuration();FileSystem fs = FileSystem.get(URI.create(uri), conf);Path path = new Path(uri);SequenceFile.Reader reader = null;try {reader = new SequenceFile.Reader(fs, path, conf);Writable key = (Writable) ReflectionUtils.newInstance(reader.getKeyClass(), conf);Writable value = (Writable) ReflectionUtils.newInstance(reader.getValueClass(), conf);long position = reader.getPosition();while (reader.next(key, value)) {String syncSeen = reader.syncSeen() ? "*" : "";System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);position = reader.getPosition(); // beginning of next record}} finally {IOUtils.closeStream(reader);}}
}