文章目录

官网及其他参考文档
Apache Iceberg是什么？
发展历史
在数据仓库生态中的位置
- 与Metastore的对比
Iceberg表格式可以解决业务什么问题？
Why Apache IceBerg?
- Netflix 面临的问题包括：
- Hive的不足与优势：
- Iceberg 的目标包括：
- Iceberg 主要设计思想：
- 在Netflex的实践
- Hive 表面临的挑战：
- - Hive 表挑战1：上云
  - Hive 表挑战2：近实时数仓
  - Hive 表挑战3：变更
- Iceberg的解决方案
- - 挑战1：上云
  - 挑战2：近实时数仓
  - 挑战3：变更
Iceberg现状
- Ecosystem对比
- Features对比
- Community Trending
- Apache Iceberg 场景简介
Iceberg未来方向
User experience
Reliability and performance
Open standard
上手体验
- spark 集成 iceberg
- - 测试环境：
  - 环境准备：
  - spark 配置
  - DDL测试
  - - 创建测试表
    - 插入数据，和查询数据
    - PARTITIONED BY
    - iceberg存储目录结构
    - ALTER TABLE ... ADD COLUMN
  - Procedure
  - - 回滚表到指定snap
    - 执行归滚
  - query测试
  - - metadata 查询
    - Time travel (only on spark-shell)
  - write测试

官网及其他参考文档

官网
Apache Iceberg

GitHub - apache Iceberg

阿里云 iceberg产品对比与快速使用
Iceberg概述 (aliyun.com)

iceberg核心特性与架构
【2】数据湖架构中 Iceberg 的核心特性_TRX1024的博客-CSDN博客_数据湖iceberg

Catalog理解
Flink、Iceberg和Hive的Catalog比较研究——《滴普科技程序员部落》 - 知乎 (zhihu.com)

Apache Iceberg是什么？

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink and Hive using a high-performance table format that works just like a SQL table.
理解：
从这个定义上来看，Iceberg是一个用于海量数据分析场景下的开源的表格式（其实笔者更愿意用Table Format），也就是说Iceberg本质上是一个表格式。
那什么是表格式？表格式和我们熟悉的文件格式（File Format）是一回事吗？表和表格式是两个概念。表是一个具象的概念，应用层面的概念，我们天天说的表是简单的行和列的组合。而表格式是数据库系统实现层面一个抽象的概念，它定义了一个表中包含哪些字段，表下面文件的组织形式、表索引信息、统计信息以及上层查询引擎读取、写入表中文件的接口。

Apache Iceberg 是一种用于跟踪超大规模表的新格式，是专门为对象存储（如S3）而设计的。

发展历史

Apache Iceberg 是由 Netflix 开发开源的，其于 2018年11月16日进入 Apache 孵化器，是 Netflix公司数据仓库基础。在功能上和我们熟悉的 Delta Lake 或者 Apache Hudi 类似，但各有优缺点。

在数据仓库生态中的位置

具体理解看该文：Apache Iceberg 你需要知道的原理与技术

与Metastore的对比

上图右侧是Iceberg在数据仓库生态中的位置，和它差不多相当的一个组件是Metastore。

不过Metastore是一个服务，而Iceberg就是一系列jar包。既然Metastore和Iceberg我们认为都是表格式，那可以将两者在schema、partition、metadata/index以及读写api这几个方面做个对比：

1.schema基本相同。

两者底层都依赖于Parquet/ ORC等文件格式，这些文件格式都支持复杂数据类型，因此上层只需要做一些适配工作就可以支持复杂数据类型。

2.partition实现完全不同。两者在partition上有很大的不同：
Iceberg中partition字段就是表中的一个字段。Iceberg中每一张表都有一个对应的文件元数据表

3.表统计信息实现粒度不同。

4.读写API实现不同。

就是Metastore表格式不支持增量拉取，而Iceberg表格式支持增量拉取，同时Iceberg表格式支持文件级别的谓词过滤，查询性能更佳。

Iceberg表格式可以解决业务什么问题？

Iceberg可以解决业务的几大问题：
1.降低NameNode的list请求压力。[新partition模式]
2.提高查询性能。[新partition模式&&新表统计信息]
3.T+1离线数仓进化为分钟级别的准实时数仓。[新API提供了准实时增量消费]
4.所有数据基于Parquet等通用开源文件格式，没有lambad架构，不需要额外的运维成本和机器成本。
5.高效低成本的表schema和partition字段变更。[基于snapshot的schema/partition变更]

Why Apache IceBerg?

ache iceberg：Netflix 数据仓库的基石 - 知乎
(zhihu.com)

Netflix 面临的问题包括：

1、不安全的操作随处可见；
2、和对象存储交互有时候会出现很大的问题；
3、无休止的可扩展性挑战。
iceberg 是一种可伸缩的表存储格式，内置了许多最佳实践。

什么？是一种存储格式？可使我们已经有 Parquet，Avro 以及 ORC 这些格式了，为什么还要设计一种新格式？

Hive的不足与优势：

iceberg 允许我们在一个文件里面修改或者过滤数据；当然多个文件也支持这些操作。为了展示这点，我们来看看一张 Hive 表。

Hive 表的核心思想是把数据组织成目录树，如上所述。

如果我们需要过滤数据，可以在 where 里面添加分区相关的信息。

带来的问题是如果一张表有很多分区，我们需要使用 HMS（Hive MetaStore）来记录这些分区，同时底层的文件系统（比如 HDFS）仍然需要在每个分区里面记录这些分区数据。

这就导致我们需要在 HMS 和文件系统里面同时保存一些状态信息；因为缺乏锁机制，所以对上面两个系统进行修改也不能保证原子性。

当然 Hive 这样维护表也不是没有好处。这种设计使得很多引擎（Hive、Spark、Presto、Flink、Pig）都支持读写 Hive 表，同时支持很多第三方工具。简单和透明使得 Hive 表变得不可或缺的。

Iceberg 的目标包括：

1、成为静态数据交换的开放规范，维护一个清晰的格式规范，支持多语言，支持跨项目的需求等。

2、提升扩展性和可靠性。能够在一个节点上运行，也能在集群上运行。所有的修改都是原子性的，串行化隔离。原生支持云对象存储，支持多并发写。
3、修复持续的可用性问题，比如模式演进，分区隐藏，支持时间旅行、回滚等。

Iceberg 主要设计思想：

记录表在所有时间的所有文件，和 Delta Lake 或 Apache Hudi 一样，支持 snapshot，其是表在某个时刻的完整文件列表。每一次写操作都会生成一个新的快照。

读取数据的时候使用当前的快照，Iceberg 使用乐观锁机制来创建新的快照，然后提交。

Iceberg 这么设计的好处是：所有的修改都是原子性的；没有耗时的文件系统操作；快照是索引好的，以便加速读取；CBO metrics 信息是可靠的；更新支持版本，支持物化视图。

在Netflex的实践

Scale：Iceberg 在 Netflix 生产环境维护着数十 PB 的数据，数百万个分区。对大表进行查询能够提供低延迟的响应。

Concurrency：生产环境中使用 Flink 管道在 3 个 AWS regions 写数据。Lift 服务将数据移到一个 region。Merge 服务对小文件进行合并。

Usability: 可用性方面：回滚是家常便饭。

Hive 表面临的挑战：

apache iceberg - 搜索结果 - 知乎 (zhihu.com)

Hive 表挑战1：上云

Write Path 依赖HDFS的多个文件Rename 原子性语义
Read Path 先查MySQL 获取分区列表，再LIST 目录获取文件
中心化的Metastore 数据库，扩展性

要求：

支持多种对象存储
特点：弹性、低廉、稳定
统一的Table语义
抽象度高，ACID,多种文件格式
计算引擎互连互通
支持Hive,Spark,Flink,Presto读写

Hive 表挑战2：近实时数仓

Hive Table：小时级别时效性体验
Generic Table :分钟级别时效性体验

入仓：HMS（ hive metastore) 受限于扩展性，难以做按分钟做分区
查询：先查MySQL找分区，再list分区目录找文件，元数据index效率低
查询：缺乏文件级全局统计信息

要求：

分钟级入湖入仓，数仓内数据更实时
更高效索引加速数据分析，查询相应更快
增量出湖出仓，下游ETL相应更快

Hive 表挑战3：变更

问题1： Schema变更（如新增一个字段）
问题2：分区变更（从月级分区改为天级分区）
问题3：CDC数据变更

要求：

Schema变更，表结构随业务变动而变更
分区变更，调整分区策略适配不同分析诉求
数据变更，表级/分区级/文件级/行级不同粒度变更

Iceberg的解决方案

Iceberg在大数据体系生产中的位置

挑战1：上云

数据访问不使用任何LIST接口
可扩展的metadata存储

ACID不依赖RENAME接口

统一的Table语义

完善的计算和多云生态对接

去中心化可拓展的metadata

挑战2：近实时数仓

丰富的metadata index加速

增量的出入湖

挑战3：变更

快速实现Schema变更

轻量级分区变更

V2支持Merge-On-Read方式更新数据

Iceberg现状

Ecosystem对比

Features对比

Community Trending

Apache Iceberg 场景简介

Iceberg未来方向

User experience

Iceberg avoids unpleasant surprises. Schema evolution works and won’t inadvertently un-delete data. Users don’t need to know about partitioning to get fast queries.

Schema evolution supports add, drop, update, or rename, and has no side-effects
Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries
Partition layout evolution can update the layout of a table as data volume or query patterns change
Time travel enables reproducible queries that use exactly the same table snapshot, or lets users easily examine changes
Version rollback allows users to quickly correct problems by resetting tables to a good state

Reliability and performance

Iceberg was built for huge tables. Iceberg is used in production where a single table can contain tens of petabytes of data and even these huge tables can be read without a distributed SQL engine.

Scan planning is fast – a distributed SQL engine isn’t needed to read a table or find files
Advanced filtering – data files are pruned with partition and column-level stats, using table metadata

Iceberg was designed to solve correctness problems in eventually-consistent cloud object stores.

Works with any cloud store and reduces NN congestion when in HDFS, by avoiding listing and renames
Serializable isolation – table changes are atomic and readers never see partial or uncommitted changes
Multiple concurrent writers use optimistic concurrency and will retry to ensure that compatible updates succeed, even when writes conflict

Open standard

Iceberg has been designed and developed to be an open community standard with a specification to ensure compatibility across languages and implementations.

Apache Iceberg is open source, and is developed at the Apache Software Foundation.

上手体验

spark 集成 iceberg

建议使用Spark3.0版本，iceberg支持好些

Getting Started | Apache Iceberg

测试环境：

spark 3.2.1, yarn 模式；

环境准备：

spark 下载：
https://spark.apache.org/downloads.html
apache iceberg 对应jar 包下载：
https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.2_2.12/0.13.1/iceberg-spark-runtime-3.2_2.12-0.13.1.jar

spark 配置

将iceberg-spark-runtime-3.2_2.12-0.13.1.jar包放置到spark 的jars目录

在spark-defaults.conf添加iceberg的相关配置如下：

## 配置iceberg
spark.sql.catalog.hive_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hive_prod.type = hive
spark.sql.catalog.hive_prod.uri = thrift://hadoop102:9083spark.sql.catalog.spark_catalog = org.apache.iceberg.spark.SparkSessionCatalog
spark.sql.catalog.spark_catalog.type = hive

本次测试配置了hive的type,如果需要集成hadoop,配置示例如下：


spark.sql.catalog.hadoop_prod = org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.hadoop_prod.type = hadoop
spark.sql.catalog.hadoop_prod.warehouse = hdfs://nn:8020/warehouse/path

DDL测试

创建测试表


CREATE TABLE hive_prod.db.sample (id bigint COMMENT 'unique id',data string)
USING iceberg;

报错：

java.lang.RuntimeException: Metastore operation failed for db.sampleat org.apache.iceberg.hive.HiveCatalog.defaultWarehouseLocation(HiveCatalog.java:461)at org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:160)

错误原因是hive中没有创建对应的数据库；
解决方法：切换到hive中，创建名为 “db"的测试数据库即可；

插入数据，和查询数据


spark-sql> insert into  hive_prod.db.sample  values(1,"hello"),(2,"world");
2022-05-28 21:31:11,306 WARN conf.HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
Time taken: 2.431 seconds
spark-sql> select * from hive_prod.db.sample;
1       hello
2       world
Time taken: 0.599 seconds, Fetched 2 row(s)
spark-sql>

PARTITIONED BY


CREATE TABLE hive_prod.db.sample_partition (id bigint,data string,category string)
USING iceberg
PARTITIONED BY (category);


insert into  hive_prod.db.sample_partition  values(1,"hello","c1"),(2,"world","c2");

多样的分区方式支持


CREATE TABLE hive_prod.db.sample_hidden_partition (id bigint,data string,category string,ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category)

Supported transformations are:

years(ts): partition by year
months(ts): partition by month
days(ts) or date(ts): equivalent to dateint partitioning
hours(ts) or date_hour(ts): equivalent to dateint and hour partitioning
bucket(N, col): partition by hashed value mod N bucket
struncate(L, col): partition by value truncated to L
- Strings are truncated to the given length
  -Integers and longs truncate to bins: truncate(10, i) produces partitions 0, 10, 20, 30, …

iceberg存储目录结构

每个表一个目录

每个表分data目录和metadata目录

data目录：
有分区，分c1和c2两个分区目录

最底层默认用parquent格式存储数据

metadata目录：
分json和avro两种格式的文件

每插入一条记录，都会对应生成新的metadata文件；每次变更都会有snap

ALTER TABLE … ADD COLUMN


ALTER TABLE hive_prod.db.sample
ADD COLUMNS (new_column string comment 'new_column docs'  );

修改表结构，会新增一个metadata.json文件，包含最新的元数据

Procedure

要执行CALL等Iceberg专有的命令，需要在spark环境中配置对应拓展；

using the following Spark property:

Spark extensions property	Iceberg extensions implementation
spark.sql.extensions	org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

回滚表到指定snap

查看表的现有数据

spark-sql> select * from hive_prod.db.sample;
1       hello   NULL
2       world   NULL
3       kitty   cidy
Time taken: 0.165 seconds, Fetched 3 row(s)

插入新数据

insert into  hive_prod.db.sample  values(6,"hello","c1")

查看表的新增数据

spark-sql> select * from hive_prod.db.sample;
1       hello   NULL
2       world   NULL
3       kitty   cidy
6       hello   c1
Time taken: 0.206 seconds, Fetched 4 row(s)

查看表所有的快照

SELECT * FROM hive_prod.db.sample.snapshots;

spark-sql> SELECT * FROM hive_prod.db.sample.snapshots;
2022-05-28 21:31:13.426 7137918414933074729     NULL    append  hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample/metadata/snap-7137918414933074729-1-2e39cc1f-06ee-47d4-ae9d-827c5f43e6bf.avro    {"added-data-files":"2","added-files-size":"1377","added-records":"2","changed-partition-count":"1","spark.app.id":"local-1653742912913","total-data-files":"2","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"1377","total-position-deletes":"0","total-records":"2"}
2022-05-28 22:31:34.966 6771433064588335547     7137918414933074729     append  hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample/metadata/snap-6771433064588335547-1-75afb873-dc05-4d06-baf8-88aff0000bdd.avro    {"added-data-files":"1","added-files-size":"971","added-records":"1","changed-partition-count":"1","spark.app.id":"local-1653742912913","total-data-files":"3","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"2348","total-position-deletes":"0","total-records":"3"}
2022-05-28 23:36:22.994 117493589956621276      6771433064588335547     append  hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample/metadata/snap-117493589956621276-1-92ac6a66-fa5e-4700-98b0-2cdedf645b91.avro     {"added-data-files":"1","added-files-size":"956","added-records":"1","changed-partition-count":"1","spark.app.id":"local-1653751939064","total-data-files":"4","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"3304","total-position-deletes":"0","total-records":"4"}
Time taken: 0.192 seconds, Fetched 3 row(s)

执行归滚


CALL hive_prod.system.rollback_to_snapshot('db.sample', 6771433064588335547);CALL hive_prod.system.rollback_to_snapshot('db.sample', 7137918414933074729);CALL hive_prod.system.rollback_to_snapshot('db.sample_partition', 8039384128018403504);

spark-sql> CALL hive_prod.system.rollback_to_snapshot('db.sample', 6771433064588335547);
6771433064588335547     6771433064588335547
Time taken: 0.265 seconds, Fetched 1 row(s)

发现bug,回滚执行失败

query测试

Iceberg uses Apache Spark’s DataSourceV2 API for data source and catalog implementations. Spark DSv2 is an evolving API with different levels of support in Spark versions:

metadata 查询

设置spark-sql打印表头
spark-sql --hiveconf hive.cli.print.header=true

spark-sql> SELECT * FROM hive_prod.db.sample_partition.files;2022-05-29 01:43:25,419 WARN conf.HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
content file_path       file_format     spec_id partition       record_count    file_size_in_bytes      column_sizes    value_counts    null_value_counts       nan_value_counts        lower_bounds    upper_bounds    key_metadata    split_offsets   equality_ids    sort_order_id
0       hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/data/category=c2/00000-7-8be08d22-4d27-4776-9cdc-4082e7d02ff8-00001.parquet    PARQUET 0       {"category":"c2"}       1       894     {1:45,2:50,3:49}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      {1:,2:adc,3:c2} {1:,2:adc,3:c2} NULL    [4]     NULL    0
0       hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/data/category=c1/00000-6-332551a6-39be-4254-99ee-ec9a65b35d82-00001.parquet    PARQUET 0       {"category":"c1"}       1       916     {1:46,2:53,3:49}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      {1:,2:binggo,3:c1}      {1:,2:binggo,3:c1}      NULL    [4]     NULL    0
0       hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/data/category=c1/00000-3-c1c71691-12a6-4c31-92ae-93401c0783ed-00001.parquet    PARQUET 0       {"category":"c1"}       1       908     {1:46,2:51,3:49}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      {1:,2:hello,3:c1}       {1:,2:hello,3:c1}       NULL    [4]     NULL    0
0       hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/data/category=c2/00001-4-2f03657c-d116-4eec-9993-47f5015a4e7a-00001.parquet    PARQUET 0       {"category":"c2"}       1       909     {1:46,2:52,3:49}        {1:1,2:1,3:1}   {1:0,2:0,3:0}   {}      {1:,2:world,3:c2}       {1:,2:world,3:c2}       NULL    [4]     NULL    0
Time taken: 4.537 seconds, Fetched 4 row(s)

同理files，可查询history and snapshots, manifests

spark-sql> SELECT * FROM hive_prod.db.sample_partition.snapshots;
2022-05-29 01:49:51,188 WARN conf.HiveConf: HiveConf of name hive.metastore.event.db.notification.api.auth does not exist
committed_at    snapshot_id     parent_id       operation       manifest_list   summary
2022-05-28 21:35:10.381 8332123309395397364     NULL    append  hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/metadata/snap-8332123309395397364-1-f5d2dcac-8b41-4405-9067-718a948b3f34.avro  {"added-data-files":"2","added-files-size":"1817","added-records":"2","changed-partition-count":"2","spark.app.id":"local-1653742912913","total-data-files":"2","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"1817","total-position-deletes":"0","total-records":"2"}
2022-05-28 21:53:10.684 8039384128018403504     8332123309395397364     append  hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/metadata/snap-8039384128018403504-1-2023d96a-2449-4330-b1dd-5715c299231b.avro  {"added-data-files":"1","added-files-size":"916","added-records":"1","changed-partition-count":"1","spark.app.id":"local-1653742912913","total-data-files":"3","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"2733","total-position-deletes":"0","total-records":"3"}
2022-05-28 21:55:53.498 2416771310372483938     8039384128018403504     append  hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/metadata/snap-2416771310372483938-1-6bb78512-7939-48f5-a905-16f9bebd9649.avro  {"added-data-files":"1","added-files-size":"894","added-records":"1","changed-partition-count":"1","spark.app.id":"local-1653742912913","total-data-files":"4","total-delete-files":"0","total-equality-deletes":"0","total-files-size":"3627","total-position-deletes":"0","total-records":"4"}
Time taken: 0.235 seconds, Fetched 3 row(s)spark-sql> SELECT * FROM hive_prod.db.sample_partition.history;
made_current_at snapshot_id     parent_id       is_current_ancestor
2022-05-28 21:35:10.381 8332123309395397364     NULL    true
2022-05-28 21:53:10.684 8039384128018403504     8332123309395397364     true
2022-05-28 21:55:53.498 2416771310372483938     8039384128018403504     true
Time taken: 0.125 seconds, Fetched 3 row(s)spark-sql> SELECT * FROM hive_prod.db.sample_partition.manifests;
path    length  partition_spec_id       added_snapshot_id       added_data_files_count  existing_data_files_count       deleted_data_files_count        partition_summaries
hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/metadata/6bb78512-7939-48f5-a905-16f9bebd9649-m0.avro  6099    0       2416771310372483938     1       0       0       [{"contains_null":false,"contains_nan":false,"lower_bound":"c2","upper_bound":"c2"}]
hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/metadata/2023d96a-2449-4330-b1dd-5715c299231b-m0.avro  6103    0       8039384128018403504     1       0       0       [{"contains_null":false,"contains_nan":false,"lower_bound":"c1","upper_bound":"c1"}]
hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition/metadata/f5d2dcac-8b41-4405-9067-718a948b3f34-m0.avro  6162    0       8332123309395397364     2       0       0       [{"contains_null":false,"contains_nan":false,"lower_bound":"c1","upper_bound":"c2"}]
Time taken: 0.206 seconds, Fetched 3 row(s)

Time travel (only on spark-shell)

Time travel is not yet supported by Spark’s SQL syntax.
目前只支持路径，后续才会支持table

"path/to/table"不知道怎么如何对应， TODO


// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 8039384128018403504L).format("iceberg").load("hdfs://hadoop102:9820/user/hive/warehouse/db.db/sample_partition")// time travel to snapshot with ID 10963874102873L
spark.read.option("snapshot-id", 10963874102873L)    .format("iceberg")    .load("path/to/table")

write测试

先复制一张表；

这种语法复制的新表，原表的metadata变更记录会合并；snap也重新开始；

spark-sql> create table hive_prod.db.sample2 select * from hive_prod.db.sample;
Response code
Time taken: 1.018 seconds
spark-sql> select * from hive_prod.db.sample2;
id      data    new_column
1       hello   NULL
2       world   NULL
3       kitty   cidy
6       hello   c1
Time taken: 0.331 seconds, Fetched 4 row(s)

修改复制表的数据
注意，id字段要带双引号，否则修改报错

spark-sql> insert into hive_prod.db.sample2 values(7,'zhangsan','shanghai');update hive_prod.db.sample2 set data="ketty" where id =3;
2022-05-29 02:35:57,307 WARN hdfs.DataStreamer: DataStreamer Exception
java.nio.channels.ClosedByInterruptExceptionat java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)spark-sql> update hive_prod.db.sample2 set data="ketty" where id ="3";spark-sql> select * from hive_prod.db.sample2;
id      data    new_column
7       zhangsan        shanghai
1       hello   NULL
2       world   NULL
3       ketty   cidy
6       hello   c1
Time taken: 0.185 seconds, Fetched 5 row(s)

测试merge into 语法

MERGE INTO prod.db.target t   -- a target table
USING (SELECT ...) s          -- the source updates
ON t.id = s.id                -- condition to find updates for target rows
WHEN ...                      -- updates

实例：

MERGE INTO hive_prod.db.sample t
USING (SELECT * from hive_prod.db.sample2 ) s
ON t.id = s.id
WHEN   MATCHED AND t.new_column  IS NULL  THEN UPDATE  SET t.new_column="default"
WHEN NOT MATCHED THEN INSERT *;

merge 结果：

select * from hive_prod.db.sample;
id      data    new_column
7       zhangsan        shanghai
1       hello   default
2       world   default
3       kitty   cidy
6       hello   c1

可以看出：新增的记录7已添加，原有的new_column为 NULL的记录被update 为default;

Apache Iceberg Research相关推荐

Apache Iceberg 快速入门
导言本文主要介绍如何快速的通过Spark访问 Iceberg table. 如果想及时了解Spark.Hadoop或者HBase相关的文章,欢迎关注微信公众号:iteblog_hadoop Spar ...
数据湖08：Apache Iceberg原理和功能介绍
系列专题:数据湖系列文章在使用不同的引擎进行大数据计算时,需要将数据根据计算引擎进行适配.这是一个相当棘手的问题,为此出现了一种新的解决方案:介于上层计算引擎和底层存储格式之间的一个中间层.这个中间 ...
Apache Iceberg 你需要知道的原理与技术
实时数据仓库的发展.架构和趋势这篇文章从实时数仓开始讲到批流一体,谈了谈对大数据架构体系发展趋势的看法.文章最后讲到了基于数据湖Iceberg实现的存储层统一方案,以及要实现此方案Iceberg需要 ...
Apache Iceberg 数据湖从入门到放弃(2) —— 初步入门
在介绍如何使用Iceberg之前,先简单地介绍一下Iceberg catalog的概念.catalog是Iceberg对表进行管理(create.drop.rename等)的一个组件.目前Iceber ...
Apache iceberg：Netflix 数据仓库的基石
Apache Iceberg 是一种用于跟踪超大规模表的新格式,是专门为对象存储(如S3)而设计的. 本文将介绍为什么 Netflix 需要构建 Iceberg,Apache Iceberg 的高层次 ...
Apache Iceberg理解和应用
目录引言 Iceberg官网定义 Iceberg数据结构与其他数据湖产品对比参考文章引言 Apache Iceberg作为一款新兴的数据湖解决方案在实现上高度抽象,在存储上能够对接当前主流的H ...
Apache Iceberg核心原理分析文件存储及数据写入流程
点击上方蓝色字体,选择"设为星标" 回复"面试"获取更多惊喜全网最全大数据面试提升手册! 第一部分:Iceberg文件存储格式 Apache Iceberg作 ...
Apache Iceberg小文件处理和读数流程分析
点击上方蓝色字体,选择"设为星标" 回复"面试"获取更多惊喜全网最全大数据面试提升手册! 第一部分:Spark读取Iceberg流程分析这个部分我们分析常规 ...
Apache Iceberg 分区表探索与实践
组件版本组件版本 Apache Iceberg 0.11 Apache Hive 2.1.0 Apache Spark 3.0 Apache Flink 1.11 文章目录简介测试分区表功能 ...
Apache Iceberg 中引入索引提升查询性能
动手点关注干货不迷路 ‍ ‍Apache Iceberg 是一种开源数据 Lakehouse 表格式,提供强大的功能和开放的生态系统,如:Time travel,ACID 事务,partition ...

Apache Iceberg Research