iceberg系列(2):存储详解-partition-1
iceberg分区演化
可以通过添加、删除、重命名或重新排序分区规范字段来改进表分区。
更改分区规格会生成一个由唯一规格 ID 标识的新规格,该 ID 将添加到表的分区规格列表中,并且可以设置为表的默认分区规格。
在变更分区规范时,更改不应导致分区字段 ID (field id)更改,因为分区字段 ID 用作清单文件(manifest)中的分区元组字段 ID(partition tuple field ID)。
在 v2 中,必须为每个分区字段显式跟踪分区字段 ID。新 ID 是根据表元数据中最后分配的分区 ID 分配的。
在v1中,分区字段id不被跟踪,而是从1000开始顺序分配。当从多个规格中读取基于清单文件(manifest)的元数据表时,这种分配机制会导致问题,因为具有相同ID的分区字段可能包含不同的数据类型。为了与旧版本兼容,对于v1表中的分区演化,建议遵循以下规则:
- 不要重新排序分区字段
- 不要删除分区字段;而是用void变换替换字段
- 仅在前一个分区规范的末尾添加分区字段
下面来看几个实例:
CREATE TABLE local.db.sample ( id bigint, data string, category string)
USING iceberg
PARTITIONED BY (category)insert into local.db.sample values(1,'a','1')
查看metainfo文件:
{"format-version" : 1,"table-uuid" : "94ad30ed-4a31-438d-b81b-36d791471d2c","location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample","last-updated-ms" : 1642174094175,"last-column-id" : 3,"schema" : {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]},"current-schema-id" : 0,"schemas" : [ {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]} ],"partition-spec" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000} ],"default-spec-id" : 0,"partition-specs" : [ {"spec-id" : 0,"fields" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000} ]} ],"last-partition-id" : 1000,"default-sort-order-id" : 0,"sort-orders" : [ {"order-id" : 0,"fields" : [ ]} ],"properties" : {"owner" : "liliwei"},"current-snapshot-id" : 3476183237498309505,"snapshots" : [ {"snapshot-id" : 3476183237498309505,"timestamp-ms" : 1642174094175,"summary" : {"operation" : "append","spark.app.id" : "local-1642173017469","added-data-files" : "1","added-records" : "1","added-files-size" : "874","changed-partition-count" : "1","total-records" : "1","total-files-size" : "874","total-data-files" : "1","total-delete-files" : "0","total-position-deletes" : "0","total-equality-deletes" : "0"},"manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro","schema-id" : 0} ],"snapshot-log" : [ {"timestamp-ms" : 1642174094175,"snapshot-id" : 3476183237498309505} ],"metadata-log" : [ {"timestamp-ms" : 1642173226793,"metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/v1.metadata.json"} ]
}%
查看snap文件:
java -jar ~/plat/tools/avro-tools-1.10.2.jar tojson snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro
{"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro","manifest_length": 6095,"partition_spec_id": 0,"added_snapshot_id": {"long": 3476183237498309505},"added_data_files_count": {"int": 1},"existing_data_files_count": {"int": 0},"deleted_data_files_count": {"int": 0},"partitions": {"array": [{"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "1"},"upper_bound": {"bytes": "1"}}]},"added_rows_count": {"long": 1},"existing_rows_count": {"long": 0},"deleted_rows_count": {"long": 0}
}
ALTER TABLE local.db.sample ADD PARTITION FIELD data
查看目录结构:
(base) ➜ metadata tree -l
.
├── 002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro
├── snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro
├── v1.metadata.json
├── v2.metadata.json
├── v3.metadata.json
└── version-hint.text0 directories, 6 files
查看v3.metadata.json文件:
{"format-version" : 1,"table-uuid" : "94ad30ed-4a31-438d-b81b-36d791471d2c","location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample","last-updated-ms" : 1642175874398,"last-column-id" : 3,"schema" : {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]},"current-schema-id" : 0,"schemas" : [ {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]} ],"partition-spec" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000}, {"name" : "data","transform" : "identity","source-id" : 2,"field-id" : 1001} ],"default-spec-id" : 1,"partition-specs" : [ {"spec-id" : 0,"fields" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000} ]}, {"spec-id" : 1,"fields" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000}, {"name" : "data","transform" : "identity","source-id" : 2,"field-id" : 1001} ]} ],"last-partition-id" : 1001,"default-sort-order-id" : 0,"sort-orders" : [ {"order-id" : 0,"fields" : [ ]} ],"properties" : {"owner" : "liliwei"},"current-snapshot-id" : 3476183237498309505,"snapshots" : [ {"snapshot-id" : 3476183237498309505,"timestamp-ms" : 1642174094175,"summary" : {"operation" : "append","spark.app.id" : "local-1642173017469","added-data-files" : "1","added-records" : "1","added-files-size" : "874","changed-partition-count" : "1","total-records" : "1","total-files-size" : "874","total-data-files" : "1","total-delete-files" : "0","total-position-deletes" : "0","total-equality-deletes" : "0"},"manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro","schema-id" : 0} ],"snapshot-log" : [ {"timestamp-ms" : 1642174094175,"snapshot-id" : 3476183237498309505} ],"metadata-log" : [ {"timestamp-ms" : 1642173226793,"metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/v1.metadata.json"}, {"timestamp-ms" : 1642174094175,"metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/v2.metadata.json"} ]
}%
插入数据
insert into local.db.sample values(2,'b','2');
查看目录结构:
(base) ➜ metadata tree -l
.
├── 002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro
├── ed1a1f56-56fc-4313-bf60-10df0c4e88ca-m0.avro
├── snap-2641901311316255446-1-ed1a1f56-56fc-4313-bf60-10df0c4e88ca.avro
├── snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro
├── v1.metadata.json
├── v2.metadata.json
├── v3.metadata.json
├── v4.metadata.json
└── version-hint.text0 directories, 9 files
java -jar ~/plat/tools/avro-tools-1.10.2.jar tojson snap-2641901311316255446-1-ed1a1f56-56fc-4313-bf60-10df0c4e88ca.avro
{"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/ed1a1f56-56fc-4313-bf60-10df0c4e88ca-m0.avro","manifest_length": 6301,"partition_spec_id": 1,"added_snapshot_id": {"long": 2641901311316255446},"added_data_files_count": {"int": 1},"existing_data_files_count": {"int": 0},"deleted_data_files_count": {"int": 0},"partitions": {"array": [{"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "2"},"upper_bound": {"bytes": "2"}}, {"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "b"},"upper_bound": {"bytes": "b"}}]},"added_rows_count": {"long": 1},"existing_rows_count": {"long": 0},"deleted_rows_count": {"long": 0}
} {"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro","manifest_length": 6095,"partition_spec_id": 0,"added_snapshot_id": {"long": 3476183237498309505},"added_data_files_count": {"int": 1},"existing_data_files_count": {"int": 0},"deleted_data_files_count": {"int": 0},"partitions": {"array": [{"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "1"},"upper_bound": {"bytes": "1"}}]},"added_rows_count": {"long": 1},"existing_rows_count": {"long": 0},"deleted_rows_count": {"long": 0}
}
iceberg系列(2):存储详解-partition-1相关推荐
- Docker系列07—Dockerfile 详解
Docker系列07-Dockerfile 详解 1.认识Dockerfile 1.1 镜像的生成途径 基于容器制作 dockerfile,docker build 基于容器制作镜像,已经在上篇Do ...
- 云原生存储详解:容器存储与 K8s 存储卷
作者 | 阚俊宝 阿里云技术专家 导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新的机遇与挑战.本文为该系列文章的第二篇,会对容 ...
- mongo 3.4分片集群系列之六:详解配置数据库
这个系列大致想跟大家分享以下篇章: 1.mongo 3.4分片集群系列之一:浅谈分片集群 2.mongo 3.4分片集群系列之二:搭建分片集群--哈希分片 3.mongo 3.4分片集群系列之三:搭建 ...
- k8s挂载目录_云原生存储详解:容器存储与 K8s 存储卷
作者 | 阚俊宝 阿里云技术专家 导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新的机遇与挑战.本文为该系列文章的第二篇,会对容 ...
- docker修改镜像的存储位置_云原生存储详解:容器存储与 K8s 存储卷(内含赠书福利)...
作者 | 阚俊宝 阿里巴巴技术专家 参与文末留言互动,即有机会获得赠书福利! 导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新 ...
- 云原生存储详解:容器存储与K8s存储卷
作者 | 阚俊宝 阿里云技术专家 导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新的机遇与挑战.本文为该系列文章的第二篇,会对容 ...
- 小猫爪:i.MX RT1050学习笔记26-RT1xxx系列的FlexCAN详解
i.MX RT1050学习笔记26-RT1xxx系列的FlexCAN详解 1 前言 2 FlexCAN简介 2.1 MB(邮箱)系统 2.1.1 正常模式下 2.1.2 激活了CAN FD情况下 2. ...
- PointNet系列代码复现详解(1)—PointNet分类部分
想尽快入门点云,因此就从这个经典的点云处理神经网络开始.源码已经有了中文注释,但在一些对于自己不理解的地方添加了一些注释.欢迎大家一起讨论. 代码是来自github:GitHub - yanx27/P ...
- C语言中itoa系列函数及sprintf系列函数使用详解
C语言中itoa系列函数及sprintf系列函数使用详解 itoa函数系列 该系列函数是广泛使用的非标准C语言和C++语言扩展功能,只能在windows编译器下使用,如果涉及到跨平台是不允许使用的,这 ...
- ftm模块linux驱动,飞思卡尔k系列_ftm模块详解.doc
飞思卡尔k系列_ftm模块详解 1.5FTM模块1.5.1 FTM模块简介FTM模块是一个多功能定时器模块,主要功能有,PWM输出.输入捕捉.输出比较.定时中断.脉冲加减计数.脉冲周期脉宽测量.在K1 ...
最新文章
- SQL server 专业词汇
- 拿到input输入的时间_【Keras 笔记】Input/Dense层的数学本质
- 推荐!神经进化才是深度学习未来的发展之路!
- 微博客户端播放器的演进之路
- python中单行注释采用的符号是什么_Python注释符号使用说明(多行注释和单行注释),用法,详解,攻略...
- matlab 查看dll的函数参数类型,MATLAB调用dll文件中的库函数时的变量类型匹配问题?...
- 安装activemq
- android多环境,Android多环境配置打包
- mysql创建jdbc数据库_创建本地数据库mySQL并连接JDBC
- linux连接交换机软件,如何用超级终端连接交换机 - 全文
- 学会这10种定时任务,我有点飘了
- 利用百度云存储制作外链mp3音乐地址
- JavaScript高级04 正则表达式
- HBBuilderProjest逆向分析与安全性扯淡
- 超级玛丽——(陷阱问题) 蓝桥杯
- 3000字梳理大数据开发流程及规范(建议收藏)
- Axios 简单使用指南
- c语言和Java你好世界,C编程语言之“你好世界”的例子
- 山东大学软件工程应用与实践——GMSSL开源库(一) ——WINDOWS下GMSSL的安装与编译的超详细保姆级攻略
- ios应用中添加广告