iceberg系列（2）：存储详解-partition-1

iceberg分区演化
可以通过添加、删除、重命名或重新排序分区规范字段来改进表分区。
更改分区规格会生成一个由唯一规格 ID 标识的新规格，该 ID 将添加到表的分区规格列表中，并且可以设置为表的默认分区规格。
在变更分区规范时，更改不应导致分区字段 ID (field id)更改，因为分区字段 ID 用作清单文件(manifest)中的分区元组字段 ID(partition tuple field ID)。
在 v2 中，必须为每个分区字段显式跟踪分区字段 ID。新 ID 是根据表元数据中最后分配的分区 ID 分配的。
在v1中，分区字段id不被跟踪，而是从1000开始顺序分配。当从多个规格中读取基于清单文件（manifest）的元数据表时，这种分配机制会导致问题，因为具有相同ID的分区字段可能包含不同的数据类型。为了与旧版本兼容，对于v1表中的分区演化，建议遵循以下规则：

不要重新排序分区字段
不要删除分区字段;而是用void变换替换字段
仅在前一个分区规范的末尾添加分区字段

下面来看几个实例：

CREATE TABLE local.db.sample ( id bigint, data string, category string)
USING iceberg
PARTITIONED BY (category)insert into local.db.sample values(1,'a','1')

查看metainfo文件：

{"format-version" : 1,"table-uuid" : "94ad30ed-4a31-438d-b81b-36d791471d2c","location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample","last-updated-ms" : 1642174094175,"last-column-id" : 3,"schema" : {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]},"current-schema-id" : 0,"schemas" : [ {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]} ],"partition-spec" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000} ],"default-spec-id" : 0,"partition-specs" : [ {"spec-id" : 0,"fields" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000} ]} ],"last-partition-id" : 1000,"default-sort-order-id" : 0,"sort-orders" : [ {"order-id" : 0,"fields" : [ ]} ],"properties" : {"owner" : "liliwei"},"current-snapshot-id" : 3476183237498309505,"snapshots" : [ {"snapshot-id" : 3476183237498309505,"timestamp-ms" : 1642174094175,"summary" : {"operation" : "append","spark.app.id" : "local-1642173017469","added-data-files" : "1","added-records" : "1","added-files-size" : "874","changed-partition-count" : "1","total-records" : "1","total-files-size" : "874","total-data-files" : "1","total-delete-files" : "0","total-position-deletes" : "0","total-equality-deletes" : "0"},"manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro","schema-id" : 0} ],"snapshot-log" : [ {"timestamp-ms" : 1642174094175,"snapshot-id" : 3476183237498309505} ],"metadata-log" : [ {"timestamp-ms" : 1642173226793,"metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/v1.metadata.json"} ]
}%

查看snap文件：
java -jar ~/plat/tools/avro-tools-1.10.2.jar tojson snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro

{"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro","manifest_length": 6095,"partition_spec_id": 0,"added_snapshot_id": {"long": 3476183237498309505},"added_data_files_count": {"int": 1},"existing_data_files_count": {"int": 0},"deleted_data_files_count": {"int": 0},"partitions": {"array": [{"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "1"},"upper_bound": {"bytes": "1"}}]},"added_rows_count": {"long": 1},"existing_rows_count": {"long": 0},"deleted_rows_count": {"long": 0}
}


ALTER TABLE local.db.sample ADD PARTITION FIELD data

查看目录结构：

(base) ➜ metadata tree -l
.
├── 002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro
├── snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro
├── v1.metadata.json
├── v2.metadata.json
├── v3.metadata.json
└── version-hint.text0 directories, 6 files

查看v3.metadata.json文件：

{"format-version" : 1,"table-uuid" : "94ad30ed-4a31-438d-b81b-36d791471d2c","location" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample","last-updated-ms" : 1642175874398,"last-column-id" : 3,"schema" : {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]},"current-schema-id" : 0,"schemas" : [ {"type" : "struct","schema-id" : 0,"fields" : [ {"id" : 1,"name" : "id","required" : false,"type" : "long"}, {"id" : 2,"name" : "data","required" : false,"type" : "string"}, {"id" : 3,"name" : "category","required" : false,"type" : "string"} ]} ],"partition-spec" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000}, {"name" : "data","transform" : "identity","source-id" : 2,"field-id" : 1001} ],"default-spec-id" : 1,"partition-specs" : [ {"spec-id" : 0,"fields" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000} ]}, {"spec-id" : 1,"fields" : [ {"name" : "category","transform" : "identity","source-id" : 3,"field-id" : 1000}, {"name" : "data","transform" : "identity","source-id" : 2,"field-id" : 1001} ]} ],"last-partition-id" : 1001,"default-sort-order-id" : 0,"sort-orders" : [ {"order-id" : 0,"fields" : [ ]} ],"properties" : {"owner" : "liliwei"},"current-snapshot-id" : 3476183237498309505,"snapshots" : [ {"snapshot-id" : 3476183237498309505,"timestamp-ms" : 1642174094175,"summary" : {"operation" : "append","spark.app.id" : "local-1642173017469","added-data-files" : "1","added-records" : "1","added-files-size" : "874","changed-partition-count" : "1","total-records" : "1","total-files-size" : "874","total-data-files" : "1","total-delete-files" : "0","total-position-deletes" : "0","total-equality-deletes" : "0"},"manifest-list" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro","schema-id" : 0} ],"snapshot-log" : [ {"timestamp-ms" : 1642174094175,"snapshot-id" : 3476183237498309505} ],"metadata-log" : [ {"timestamp-ms" : 1642173226793,"metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/v1.metadata.json"}, {"timestamp-ms" : 1642174094175,"metadata-file" : "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/v2.metadata.json"} ]
}%

插入数据

insert into local.db.sample values(2,'b','2');

查看目录结构：

(base) ➜ metadata tree -l
.
├── 002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro
├── ed1a1f56-56fc-4313-bf60-10df0c4e88ca-m0.avro
├── snap-2641901311316255446-1-ed1a1f56-56fc-4313-bf60-10df0c4e88ca.avro
├── snap-3476183237498309505-1-002e475b-e5b9-485e-a59d-35730a6c9f4e.avro
├── v1.metadata.json
├── v2.metadata.json
├── v3.metadata.json
├── v4.metadata.json
└── version-hint.text0 directories, 9 files

java -jar ~/plat/tools/avro-tools-1.10.2.jar tojson snap-2641901311316255446-1-ed1a1f56-56fc-4313-bf60-10df0c4e88ca.avro

{"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/ed1a1f56-56fc-4313-bf60-10df0c4e88ca-m0.avro","manifest_length": 6301,"partition_spec_id": 1,"added_snapshot_id": {"long": 2641901311316255446},"added_data_files_count": {"int": 1},"existing_data_files_count": {"int": 0},"deleted_data_files_count": {"int": 0},"partitions": {"array": [{"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "2"},"upper_bound": {"bytes": "2"}}, {"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "b"},"upper_bound": {"bytes": "b"}}]},"added_rows_count": {"long": 1},"existing_rows_count": {"long": 0},"deleted_rows_count": {"long": 0}
} {"manifest_path": "/Users/liliwei/plat/spark-3.1.2-bin-hadoop3.2/warehouse/db/sample/metadata/002e475b-e5b9-485e-a59d-35730a6c9f4e-m0.avro","manifest_length": 6095,"partition_spec_id": 0,"added_snapshot_id": {"long": 3476183237498309505},"added_data_files_count": {"int": 1},"existing_data_files_count": {"int": 0},"deleted_data_files_count": {"int": 0},"partitions": {"array": [{"contains_null": false,"contains_nan": {"boolean": false},"lower_bound": {"bytes": "1"},"upper_bound": {"bytes": "1"}}]},"added_rows_count": {"long": 1},"existing_rows_count": {"long": 0},"deleted_rows_count": {"long": 0}
}

iceberg系列（2）：存储详解-partition-1相关推荐

Docker系列07—Dockerfile 详解
Docker系列07-Dockerfile 详解 1.认识Dockerfile 1.1 镜像的生成途径基于容器制作 dockerfile,docker build 基于容器制作镜像,已经在上篇Do ...
云原生存储详解：容器存储与 K8s 存储卷
作者 | 阚俊宝阿里云技术专家导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新的机遇与挑战.本文为该系列文章的第二篇,会对容 ...
mongo 3.4分片集群系列之六：详解配置数据库
这个系列大致想跟大家分享以下篇章: 1.mongo 3.4分片集群系列之一:浅谈分片集群 2.mongo 3.4分片集群系列之二:搭建分片集群--哈希分片 3.mongo 3.4分片集群系列之三:搭建 ...
k8s挂载目录_云原生存储详解：容器存储与 K8s 存储卷
作者 | 阚俊宝阿里云技术专家导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新的机遇与挑战.本文为该系列文章的第二篇,会对容 ...
docker修改镜像的存储位置_云原生存储详解：容器存储与 K8s 存储卷（内含赠书福利）...
作者 | 阚俊宝阿里巴巴技术专家参与文末留言互动,即有机会获得赠书福利! 导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新 ...
云原生存储详解：容器存储与K8s存储卷
作者 | 阚俊宝阿里云技术专家导读:云原生存储详解系列文章将从云原生存储服务的概念.特点.需求.原理.使用及案例等方面,和大家一起探讨云原生存储技术新的机遇与挑战.本文为该系列文章的第二篇,会对容 ...
小猫爪：i.MX RT1050学习笔记26-RT1xxx系列的FlexCAN详解
i.MX RT1050学习笔记26-RT1xxx系列的FlexCAN详解 1 前言 2 FlexCAN简介 2.1 MB(邮箱)系统 2.1.1 正常模式下 2.1.2 激活了CAN FD情况下 2. ...
PointNet系列代码复现详解(1)—PointNet分类部分
想尽快入门点云,因此就从这个经典的点云处理神经网络开始.源码已经有了中文注释,但在一些对于自己不理解的地方添加了一些注释.欢迎大家一起讨论. 代码是来自github:GitHub - yanx27/P ...
C语言中itoa系列函数及sprintf系列函数使用详解
C语言中itoa系列函数及sprintf系列函数使用详解 itoa函数系列该系列函数是广泛使用的非标准C语言和C++语言扩展功能,只能在windows编译器下使用,如果涉及到跨平台是不允许使用的,这 ...
ftm模块linux驱动,飞思卡尔k系列_ftm模块详解.doc
飞思卡尔k系列_ftm模块详解 1.5FTM模块1.5.1 FTM模块简介FTM模块是一个多功能定时器模块,主要功能有,PWM输出.输入捕捉.输出比较.定时中断.脉冲加减计数.脉冲周期脉宽测量.在K1 ...

iceberg系列（2）：存储详解-partition-1

iceberg系列（2）：存储详解-partition-1相关推荐

最新文章

热门文章