Hive语言手册-ORC

LanguageManual ORC

ORC File Format

Version

Introduced in Hive version 0.11.0.

Optimized Row Columnar(ORC)文件格式提供了存储Hive数据的高效方法。它的设计是为了克服其他Hive文件格式的限制。使用ORC文件可以提高Hive在读取、写入和处理数据时的性能。

与RCFile格式相比,ORC文件格式有很多优点:

  • 单个文件作为每个任务的输出,它减少了NameNode的负载。
  • Hive类型支持,包括datetime、decimal和复杂类型(struct、list、map和union)
  • 在文件中存储轻量级索引
    • 跳过不通过谓词过滤的行组
    • 寻找一个给定的行
  • 基于数据类型的块模式压缩
    • 整数列的运行长度编码
    • 对字符串列的字典编码
  • 使用单独的记录阅读器并发读取相同的文件
  • 能够不扫描标记分割文件
  • 限制读写所需的内存
  • 使用协议缓冲区存储的元数据,允许添加和删除字段

文件结构

ORC文件包含称为stripes的行数据组,以及文件页脚中的辅助信息。在文件的末尾,一个postscript 保存了压缩参数和压缩页脚的大小。

默认的脚本大小为250MB。大的stripes可以大块的高效的从HDFS上读取。

文件页脚包含文件中的stripes 列表、每个stripes 的行数和每列的数据类型。它还包含列级的聚合计数、最小值、最大值和总和。

该图说明了ORC文件结构:

Stripe Structure

As shown in the diagram, each stripe in an ORC file holds index data, row data, and a stripe footer.

The stripe footer contains a directory of stream locations. Row data is used in table scans.

Index data includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block.  Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.

Having relatively frequent row index entries enables row-skipping within a stripe for rapid reads, despite large stripe sizes. By default every 10,000 rows can be skipped.

With the ability to skip large sets of rows based on filter predicates, you can sort a table on its secondary keys to achieve a big reduction in execution time. For example, if the primary partition is transaction date, the table can be sorted on state, zip code, and last name. Then looking for records in one state will skip the records of all other states.

A complete specification of the format is given in the ORC specification.

HiveQL Syntax

File formats are specified at the table (or partition) level. You can specify the ORC file format with HiveQL statements such as these:

  • CREATE TABLE ... STORED AS ORC
  • ALTER TABLE ... [PARTITION partition_spec] SET FILEFORMAT ORC
  • SET hive.default.fileformat=Orc

The parameters are all placed in the TBLPROPERTIES (see Create Table). They are:

Key

Default

Notes

orc.compress ZLIB high level compression (one of NONE, ZLIB, SNAPPY)
orc.compress.size 262,144 number of bytes in each compression chunk
orc.stripe.size 67,108,864 number of bytes in each stripe
orc.row.index.stride 10,000 number of rows between index entries (must be >= 1000)
orc.create.index true whether to create row indexes
orc.bloom.filter.columns "" comma separated list of column names for which bloom filter should be created
orc.bloom.filter.fpp 0.05 false positive probability for bloom filter (must >0.0 and <1.0)

For example, creating an ORC stored table without compression:

create table Addresses (name string,street string,city string,state string,zip int
) stored as orc tblproperties ("orc.compress"="NONE");

Version 0.14.0+: CONCATENATE

ALTER TABLE table_name [PARTITION partition_spec] CONCATENATE can be used to merge small ORC files into a larger file, starting in Hive 0.14.0. The merge happens at the stripe level, which avoids decompressing and decoding the data.

Serialization and Compression

The serialization of column data in an ORC file depends on whether the data type is integer or string.

Integer Column Serialization

Integer columns are serialized in two streams.

  1. present bit stream: is the value non-null?
  2. data stream: a stream of integers

Integer data is serialized in a way that takes advantage of the common distribution of numbers:

  • Integers are encoded using a variable-width encoding that has fewer bytes for small integers.
  • Repeated values are run-length encoded.
  • Values that differ by a constant in the range (-128 to 127) are run-length encoded.

The variable-width encoding is based on Google's protocol buffers and uses the high bit to represent whether this byte is not the last and the lower 7 bits to encode data. To encode negative numbers, a zigzag encoding is used where 0, -1, 1, -2, and 2 map into 0, 1, 2, 3, 4, and 5 respectively.

Each set of numbers is encoded this way:

  • If the first byte (b0) is negative:

    • -b0 variable-length integers follow.
  • If the first byte (b0) is positive:
    • it represents b0 + 3 repeated integers
    • the second byte (-128 to +127) is added between each repetition
    • 1 variable-length integer.

In run-length encoding, the first byte specifies run length and whether the values are literals or duplicates. Duplicates can step by -128 to +128. Run-length encoding uses protobuf style variable-length integers.

String Column Serialization

Serialization of string columns uses a dictionary to form unique column values. The dictionary is sorted to speed up predicate filtering and improve compression ratios.

String columns are serialized in four streams.

  1. present bit stream: is the value non-null?
  2. dictionary data: the bytes for the strings
  3. dictionary length: the length of each entry
  4. row data: the row values

Both the dictionary length and the row values are run-length encoded streams of integers.

Compression

Streams are compressed using a codec, which is specified as a table property for all streams in that table. To optimize memory use, compression is done incrementally as each block is produced. Compressed blocks can be jumped over without first having to be decompressed for scanning. Positions in the stream are represented by a block start location and an offset into the block.

The codec can be Snappy, Zlib, or none.

ORC File Dump Utility

The ORC file dump utility analyzes ORC files.  To invoke it, use this command:

// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
 
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
 
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump] 
    [--backup-path <new-path>] <location-of-orc-file-or-directory>

Specifying -d in the command will cause it to dump the ORC file data rather than the metadata (Hive 1.1.0 and later).

Specifying --rowindex with a comma separated list of column ids will cause it to print row indexes for the specified columns, where 0 is the top level struct containing all of the columns and 1 is the first column id (Hive 1.1.0 and later).

Specifying -t in the command will print the timezone id of the writer.

Specifying -j in the command will print the ORC file metadata in JSON format. To pretty print the JSON metadata, add -p to the command.

Specifying --recover in the command will recover a corrupted ORC file generated by Hive streaming.

Specifying --skip-dump along with --recover will perform recovery without dumping metadata.

Specifying --backup-path with a new-path will let the recovery tool move corrupted files to the specified backup path (default: /tmp).

<location-of-orc-file> is the URI of the ORC file.

<location-of-orc-file-or-directory> is the URI of the ORC file or directory. From Hive 1.3.0 onward, this URI can be a directory containing ORC files.

ORC Configuration Parameters

The ORC configuration parameters are described in Hive Configuration Properties – ORC File Format.

ORC Format Specification

The ORC specification has moved to ORC project

原文地址http://www.bieryun.com/2447.html

Hive语言手册-ORC相关推荐

  1. 易语言基础编程知识〖E语言手册〗

        易语言手册易语言的命名约定 在易语言应用程序的编写过程中,用户涉及到定义各类名称,如:子程序名.变量名.数据类型名等等,这些名称的命名规则为:名称的首字母必须为全半角字母或汉字,其它字符必须为 ...

  2. linux怎么看文件是否orc格式,hive文件存储格式orc,parquet,avro对比

    orc文件存储格式 ORC文件也是以二进制方式列式存储的,所以是不可以直接读取,ORC文件也是自解析的,它包含许多的元数据,这些元数据都是同构ProtoBuffer进行序列化的.文件结构如下 ORC文 ...

  3. 《安富莱嵌入式周报》第313期:搬运机器人,微软出的C语言手册,开源生物信号采集板,开源SMD回流焊,开源SDR无线电,汽车级机器人评估板

    周报汇总地址:嵌入式周报 - uCOS & uCGUI & emWin & embOS & TouchGFX & ThreadX - 硬汉嵌入式论坛 - Pow ...

  4. Hive存储格式之ORC File详解,什么是ORC File

    文章目录 概述 文件存储结构 Stripe Index Data Row Data Stripe Footer 两个补充名词 Row Group Stream File Footer 条纹信息 列统计 ...

  5. sparksql出现 serious problem at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo

    问题  今天一大早给运营小妹用sparksql跑埋点数据,但是sparksql却给我抛了这么一个东西. select source,version,count(1) as num from app.a ...

  6. c语言手册 html,C语言版完全指引手册beta版——初到者必读 — 编程爱好者社区...

    主题:C语言版完全指引手册beta版--初到者必读 雨中飞燕 [专家分:18980] 发布于 2007-07-29 11:51:00 关于许多朋友提出的本版置顶帖过多且技术含量不高,版面帮助信息基本没 ...

  7. python库和语言手册_pytorch 中文手册

    PyTorch 中文手册(pytorch handbook) 书籍介绍 这是一本开源的书籍,目标是帮助那些希望和使用PyTorch进行深度学习开发和研究的朋友快速入门. 由于本人水平有限,在写此教程的 ...

  8. 数据库Mysql的基本语言手册

    这是看过的课程内容,用他们给的笔记做记录.mysql8.0,视图软件:navicat 1.操作数据库 /*查询所有数据库标准语法:SHOW DATABASES; */ -- 查询所有数据库 SHOW ...

  9. Hadoop HIVE

    数据仓库工具.构建在hadoop上的数据仓库框架,可以把hadoop下的原始结构化数据变成Hive中的表.(主要解决ad-hoc query,即时查询的问题) 支持一种与SQL几乎完全相同的语言HQL ...

  10. HIVE ORC 报错ClassCastException

    HIVE ORC格式的表查询报错 Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache. ...

最新文章

  1. js进阶 11-16 jquery如何查找元素的父亲、祖先和子代、后代
  2. 中小型企业VMware服务器虚拟化实用案例
  3. kali启动cobaltstrike_Cobalt Strike MetaSploit 联动
  4. python中tile的用法_python3中numpy函数tile的用法详解
  5. WSDM 2021 | 基于双向推理的多跳知识库问答技术
  6. Hadoop 集群启动与停止
  7. JAXB和未映射的属性
  8. linux 串口最高速率,Uart 16c950 linux速度高于B4000000(4Mbps)
  9. 李伟山:金融撮合架构
  10. 腾讯视频会员宣布涨价:一年253元 你还续费吗?
  11. python类方法是什么_python中什么是类方法
  12. 苹果手机软件升级密码_拥有苹果全家桶以后那些事
  13. java的jmm模型_【深入理解JVM】:Java内存模型JMM
  14. 智能一代云平台(三十七):Java技术栈
  15. 500 Internal Server Error
  16. Python读取 csv文件中文乱码处理
  17. 小甲鱼Python学习笔记之函数(二)
  18. java中的移位运算符<<,>>,>>>
  19. 电脑上虚拟打印机如何将word转jpg
  20. 各行业容灾备份架构#容灾#,

热门文章

  1. python随机抽号_Python基础:手把手以实例教你学随机数产生和字符/ASCII码转换
  2. 项目启动 Injection of autowired dependencies failed
  3. python删除链表_基于Python和C++实现删除链表的节点
  4. c语言回文串试题,最短回文串 -- C语言
  5. php mysql统计去掉重复的,php - Mysql:根据最小数量删除重复记录 - 堆栈内存溢出...
  6. webgis 行政图报错_开源WebGIS:地图发布与地图服务
  7. ZYNQ FPGA程序固化流程
  8. 伸缩Kubernetes到2500个节点中遇到的问题和解决方法
  9. web测试实践作业进度报告三
  10. angular中的装饰器 详解