java LineRecordReader类解析

属性

private long start;private long pos;private long end;private SplitLineReader in;private FSDataInputStream fileIn;private Seekable filePosition;private int maxLineLength;private LongWritable key;private Text value;private boolean isCompressedInput;private Decompressor decompressor;private byte[] recordDelimiterBytes;

构造方法

有两种,带参,无参,主要是有没有分隔符的区别

public LineRecordReader() {}public LineRecordReader(byte[] recordDelimiter) {this.recordDelimiterBytes = recordDelimiter;}

方法

参数

 public void initialize(InputSplit genericSplit,TaskAttemptContext context) throws IOException {FileSplit split = (FileSplit) genericSplit;Configuration job = context.getConfiguration();this.maxLineLength = job.getInt(MAX_LINE_LENGTH, Integer.MAX_VALUE);start = split.getStart();end = start + split.getLength();final Path file = split.getPath();// open the file and seek to the start of the splitfinal FileSystem fs = file.getFileSystem(job);fileIn = fs.open(file);CompressionCodec codec = new CompressionCodecFactory(job).getCodec(file);if (null!=codec) {isCompressedInput = true;    decompressor = CodecPool.getDecompressor(codec);if (codec instanceof SplittableCompressionCodec) {final SplitCompressionInputStream cIn =((SplittableCompressionCodec)codec).createInputStream(fileIn, decompressor, start, end,SplittableCompressionCodec.READ_MODE.BYBLOCK);in = new CompressedSplitLineReader(cIn, job,this.recordDelimiterBytes);start = cIn.getAdjustedStart();end = cIn.getAdjustedEnd();filePosition = cIn;} else {in = new SplitLineReader(codec.createInputStream(fileIn,decompressor), job, this.recordDelimiterBytes);filePosition = fileIn;}} else {fileIn.seek(start);in = new UncompressedSplitLineReader(fileIn, job, this.recordDelimiterBytes, split.getLength());filePosition = fileIn;}// If this is not the first split, we always throw away first record// because we always (except the last split) read one extra line in// next() method.if (start != 0) {start += in.readLine(new Text(), 0, maxBytesToConsume(start));}this.pos = start;}

主要看一下位置怎么来的.都是在Inputsplit中的field获取的
start = split.getStart();
end = start + split.getLength();
start是

/** The position of the first byte in the file to process. */public long getStart() { return start; }/** The number of bytes in the file to process. */@Overridepublic long getLength() { return length; }

key是怎么得来的呢?第一个key那就是开始位置,下一个key呢,value呢,如何获取?
如果key为空,就创建一个,初始值,long的初始值是啥呢?
如果value为空,就赋予一个Text的初始值.

 public boolean nextKeyValue() throws IOException {if (key == null) {key = new LongWritable();}key.set(pos);if (value == null) {value = new Text();}int newSize = 0;// We always read one extra line, which lies outside the upper// split limit i.e. (end - 1)while (getFilePosition() <= end || in.needAdditionalRecordAfterSplit()) {if (pos == 0) {newSize = skipUtfByteOrderMark();} else {newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));pos += newSize;}if ((newSize == 0) || (newSize < maxLineLength)) {break;}// line too long. try againLOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize));}if (newSize == 0) {key = null;value = null;return false;} else {return true;}}

那么如何判断下一个key和value还有没有呢?或者是否读到末尾了呢?
看pos,pos的值通过newSize变化.newSize的值又从何而来,通过readLine
方法.

newSize = in.readLine(value, maxLineLength, maxBytesToConsume(pos));pos += newSize;

readline做啥的?返回值是个数字.从new Line读取的byte的数量
the number of bytes read including the (longest) newline found.

从InputStream读取一行到给定的Text中。

参数： str –存储给定行的对象（不包含换行符） maxLineLength –要存储到str中的最大字节数；
该行的其余部分将被静默丢弃。 maxBytesToConsume –在此调用中消耗的最大字节数。
这仅是一个提示，因为如果线越过该阈值，我们就允许它发生。它可能会超出一个缓冲区长度。返回值：读取的字节数，包括找到的（最长）换行符。

总结下,如果new Size=0,就把key换value赋为null,同时返回false.就是没有下一对key和value.

就是说,为啥每次读的是一行,是readline这个方法搞定的.

public int readLine(Text str, int maxLineLength,int maxBytesToConsume) throws IOException {if (this.recordDelimiterBytes != null) {return readCustomLine(str, maxLineLength, maxBytesToConsume);} else {return readDefaultLine(str, maxLineLength, maxBytesToConsume);}}

如果没有传分隔符,就调用readDefaultLine方法.默认分割符在这里定义的…
我们正在从InputSplit中读取数据，但流的开头可能是已经缓冲在缓冲区中，因此我们有几种情况：
* 1.缓冲区中没有换行符，因此我们需要复制一切，然后从流中读取另一个缓冲区。
* 2.明确终止的行在缓冲区中，因此我们只是复制到str。
* 3.含糊处的行在缓冲区中，即缓冲区结束
*在CR中。在这种情况下，我们将所有内容复制到CR到str，

 /*** Read a line terminated by one of CR, LF, or CRLF.*/private int readDefaultLine(Text str, int maxLineLength, int maxBytesToConsume)throws IOException {/* We're reading data from in, but the head of the stream may be* already buffered in buffer, so we have several cases:* 1. No newline characters are in the buffer, so we need to copy*    everything and read another buffer from the stream.* 2. An unambiguously terminated line is in buffer, so we just*    copy to str.* 3. Ambiguously terminated line is in buffer, i.e. buffer ends*    in CR.  In this case we copy everything up to CR to str, but*    we also need to see what follows CR: if it's LF, then we*    need consume LF as well, so next call to readLine will read*    from after that.* We use a flag prevCharCR to signal if previous character was CR* and, if it happens to be at the end of the buffer, delay* consuming it until we have a chance to look at the char that* follows.*/str.clear();int txtLength = 0; //tracks str.getLength(), as an optimizationint newlineLength = 0; //length of terminating newlineboolean prevCharCR = false; //true of prev char was CRlong bytesConsumed = 0;do {int startPosn = bufferPosn; //starting from where we left off the last timeif (bufferPosn >= bufferLength) {startPosn = bufferPosn = 0;if (prevCharCR) {++bytesConsumed; //account for CR from previous read}bufferLength = fillBuffer(in, buffer, prevCharCR);if (bufferLength <= 0) {break; // EOF}}for (; bufferPosn < bufferLength; ++bufferPosn) { //search for newlineif (buffer[bufferPosn] == LF) {newlineLength = (prevCharCR) ? 2 : 1;++bufferPosn; // at next invocation proceed from following bytebreak;}if (prevCharCR) { //CR + notLF, we are at notLFnewlineLength = 1;break;}prevCharCR = (buffer[bufferPosn] == CR);}int readLength = bufferPosn - startPosn;if (prevCharCR && newlineLength == 0) {--readLength; //CR at the end of the buffer}bytesConsumed += readLength;int appendLength = readLength - newlineLength;if (appendLength > maxLineLength - txtLength) {appendLength = maxLineLength - txtLength;}if (appendLength > 0) {str.append(buffer, startPosn, appendLength);txtLength += appendLength;}} while (newlineLength == 0 && bytesConsumed < maxBytesToConsume);if (bytesConsumed > Integer.MAX_VALUE) {throw new IOException("Too many bytes before newline: " + bytesConsumed);}return (int)bytesConsumed;}

简单总结一下,
LineRecordReader 利用LineReader的readline方法读取每一行数据,默认碰到换行符就转化为key,value.LineRecordReader负责把inputSplit转化为kv对.具体什么时候调用的呢?

然后这里只是判断有没有下一个keyvalue,那么下一个key value是啥,是在哪里设置的呢?
key value在哪里设置

还记得上面的redadDefaultline么,隐藏的相当深

 if (appendLength > 0) {str.append(buffer, startPosn, appendLength);txtLength += appendLength;}

java LineRecordReader类解析相关推荐

java常用类解析十：Date类和Calendar类示例
1.Date类实例:格式化输出当前日期 [java] view plaincopy <span style="font-size:16px;">package demo ...
java常用类解析五：IO系统File类及文件搜索工具类
1.先看一个File类的简单的例子 [java] view plaincopy <span style="font-size:16px;">package test; ...
java HTableDescriptor类解析
HTableDescriptor主要用于和Hbase表相关的操作. 继承了WritableComparable接口,就是说可以序列化以及比较 public class HTableDescriptor ...
Java String、StringBuffer、StringBuilder类解析
String.StringBuffer.StringBuilder类解析概述 String类:代表字符串. 特点 String实现了Serializable接口,表示String是可序列化的实现了 ...
java 自定义json解析注解复杂json解析工具类
java 自定义json解析注解复杂json解析工具类目录 java 自定义json解析注解复杂json解析工具类 1.背景 2.需求-各式各样的json 一.一星难度json[json对象 ...
Java生成和解析二维码工具类（简单经典）
Java生成和解析二维码工具类开箱即用,简单不废话. pom.xml引入依赖 <!-- https://mvnrepository.com/artifact/com.google.zxing/ ...
【Java系列】从JVM角度解析Java核心类String的不可变特性
凯伦说,公众号ID: KailunTalk,努力写出最优质的技术文章,欢迎关注探讨. 1. 前言最近看到几个有趣的关于Java核心类String的问题. String类是如何实现其不可变的特性的,设 ...
编辑从字节码和 JVM 的角度解析 Java 核心类 String 的不可变特性
1. 前言最近看到几个有趣的关于Java核心类String的问题. String类是如何实现其不可变的特性的,设计成不可变的好处在哪里. 为什么不推荐使用+号的方式去形成新的字符串,推荐使用Stri ...
java File类常用相关函数
java File类相关函数 1.构建函数 2.判断相关函数 4.创建相关函数 5.删除相关函数 6.获取相关函数 1.构建函数 /*** File 构造方法* file 常用构造方法* 1: Fi ...

java LineRecordReader类解析

属性

构造方法

方法

java LineRecordReader类解析相关推荐

最新文章

热门文章