Hyperspace初体验:Delta Lake表索引

1.简介

Hyperspace是一个由微软开发的开源的数据湖索引子系统。

1.1 特性

  • 提供了一套定义完好的索引管理API (4.1 建索引, 5.2 增量刷新)

    • 为用户提供更大的自由度,毕竟用户是最了解自己用例的人
    • 不尝试去解决所有问题,有些问题没有固定答案
  • 独立于数据和元数据,索引有自己的元数据/日志(4.2 Hyperspace日志)

  • 能感知到底层数据的版本,同数据一起“时间旅行” (5.1 Time Travel支持)

  • 因为Hybrid Scan,索引与原始数据不需要完全同步 (5.3 Hybrid Scan)

1.2 开发中

Hyperspace项目目前比较活跃,还有很多重要功能在开发中,想要预览的话可以尝试用 master 分支代码:

  • Spark 3.x: https://github.com/microsoft/hyperspace/issues/407

  • Iceberg: https://github.com/microsoft/hyperspace/issues/318

  • 索引实现

    • Z-ordering: https://github.com/microsoft/hyperspace/issues/515
    • Data skipping: https://github.com/microsoft/hyperspace/issues/441

2.安装

2.1 软件版本

  • Hyperspace 0.4.0支持Delta Lake表的索引
  • Delta Lake 0.6.1是最后一个支持Spark 2.x的版本
  • Spark 2.4.2和Scala 2.12.8与上述Hyperspace, Delta Lake版本兼容

因为对 Apache Spark 3.x的支持还在开发中, Spark 2.4.2看起来是目前唯一能让Hyperspace和Delta Lake共同工作的Spark版本。

2.2 CLI启动

bin/spark-shell --packages io.delta:delta-core_2.12:0.6.1,com.microsoft.hyperspace:hyperspace-core_2.12:0.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"Ivy Default Cache set to: /Users/daichen/.ivy2/cache
The jars for the packages stored in: /Users/daichen/.ivy2/jars
:: loading settings :: url = jar:file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
io.delta#delta-core_2.12 added as a dependency
com.microsoft.hyperspace#hyperspace-core_2.11 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-040a057e-2b7f-4d3a-a658-b93e6474dc47;1.0confs: [default]found io.delta#delta-core_2.12;0.6.1 in centralfound org.antlr#antlr4;4.7 in centralfound org.antlr#antlr4-runtime;4.7 in local-m2-cachefound org.antlr#antlr-runtime;3.5.2 in local-m2-cachefound org.antlr#ST4;4.0.8 in local-m2-cachefound org.abego.treelayout#org.abego.treelayout.core;1.0.3 in local-m2-cachefound org.glassfish#javax.json;1.0.4 in local-m2-cachefound com.ibm.icu#icu4j;58.2 in local-m2-cachefound com.microsoft.hyperspace#hyperspace-core_2.11;0.3.0 in central
:: resolution report :: resolve 351ms :: artifacts dl 9ms:: modules in use:com.ibm.icu#icu4j;58.2 from local-m2-cache in [default]com.microsoft.hyperspace#hyperspace-core_2.11;0.3.0 from central in [default]io.delta#delta-core_2.12;0.6.1 from central in [default]org.abego.treelayout#org.abego.treelayout.core;1.0.3 from local-m2-cache in [default]org.antlr#ST4;4.0.8 from local-m2-cache in [default]org.antlr#antlr-runtime;3.5.2 from local-m2-cache in [default]org.antlr#antlr4;4.7 from central in [default]org.antlr#antlr4-runtime;4.7 from local-m2-cache in [default]org.glassfish#javax.json;1.0.4 from local-m2-cache in [default]---------------------------------------------------------------------|                  |            modules            ||   artifacts   ||       conf       | number| search|dwnlded|evicted|| number|dwnlded|---------------------------------------------------------------------|      default     |   9   |   0   |   0   |   0   ||   9   |   0   |---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-040a057e-2b7f-4d3a-a658-b93e6474dc47confs: [default]0 artifacts copied, 9 already retrieved (0kB/8ms)
21/12/09 16:15:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.4.51:4040
Spark context available as 'sc' (master = local[*], app id = local-1639095359127).
Spark session available as 'spark'.
Welcome to____              __/ __/__  ___ _____/ /___\ \/ _ \/ _ `/ __/  '_//___/ .__/\_,_/_/ /_/\_\   version 2.4.2/_/Using Scala version 2.12.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_312)
Type in expressions to have them evaluated.
Type :help for more information.scala>

3.Delta Lake表

3.1 建表

本文所有例子都使用Spark自带的 employees.json 数据集。首先,我们就先用这个数据集建一张Delta Lake的分区表。

scala> val employees = spark.read.json("examples/src/main/resources/employees.json")
employees: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]scala> employees.show()
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
|   Andy|  4500|
| Justin|  3500|
|  Berta|  4000|
+-------+------+scala> employees.write.partitionBy("salary").format("delta").save("/tmp/delta-table/employees")

现在可以看一下Delta Lake的文件夹结构:

$ tree /tmp/delta-table/employees
/tmp/delta-table/employees
├── _delta_log
│   └── 00000000000000000000.json
├── salary=3000
│   └── part-00000-e0493ade-23d4-402e-bf43-0ad727b0754e.c000.snappy.parquet
├── salary=3500
│   └── part-00000-5bae7fa3-43c3-43a7-85a2-286848b5d589.c000.snappy.parquet
├── salary=4000
│   └── part-00000-090c7267-5a47-439a-8ca6-3ae670969ac3.c000.snappy.parquet
└── salary=4500└── part-00000-375f485f-d021-4625-b598-1f86b3ad8953.c000.snappy.parquet

3.2 Delta Lake日志

Delta日志记录了数据文件中所有的增加和删除操作(因为Data Lake存储,比如S3,不支持文件的更新甚至追加),这个日志本身也是 Parquet格式的:

$ cat /tmp/delta-table/employees/_delta_log/00000000000000000000.json
{"commitInfo":{"timestamp":1639095874103,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[\"salary\"]"},"isBlindAppend":true,"operationMetrics":{"numFiles":"4","numOutputBytes":"1679","numOutputRows":"4"}}}
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"0c5bbb47-1ee3-4637-883c-61284a73501b","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{}},{\"name\":\"salary\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":["salary"],"configuration":{},"createdTime":1639095873565}}
{"add":{"path":"salary=3000/part-00000-e0493ade-23d4-402e-bf43-0ad727b0754e.c000.snappy.parquet","partitionValues":{"salary":"3000"},"size":434,"modificationTime":1639095873996,"dataChange":true}}
{"add":{"path":"salary=3500/part-00000-5bae7fa3-43c3-43a7-85a2-286848b5d589.c000.snappy.parquet","partitionValues":{"salary":"3500"},"size":425,"modificationTime":1639095874020,"dataChange":true}}
{"add":{"path":"salary=4000/part-00000-090c7267-5a47-439a-8ca6-3ae670969ac3.c000.snappy.parquet","partitionValues":{"salary":"4000"},"size":416,"modificationTime":1639095874044,"dataChange":true}}
{"add":{"path":"salary=4500/part-00000-375f485f-d021-4625-b598-1f86b3ad8953.c000.snappy.parquet","partitionValues":{"salary":"4500"},"size":404,"modificationTime":1639095874069,"dataChange":true}}

4.Hyperspace基础

4.1 创建索引

让我们先来创建一个索引(Hyperspace目前主要支持Covering索引,即索引本身也包括其他列数据,这样查询可以直接在索引里完成,无需访问原始数据):

scala> val employees = spark.read.format("delta").load("/tmp/delta-table/employees")
employees: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]                                                                                          scala> spark.conf.set("spark.hyperspace.index.sources.fileBasedBuilders",|       "com.microsoft.hyperspace.index.sources.delta.DeltaLakeFileBasedSourceBuilder," +|       "com.microsoft.hyperspace.index.sources.default.DefaultFileBasedSourceBuilder")scala> import com.microsoft.hyperspace._
import com.microsoft.hyperspace._scala> val hyperspace = new Hyperspace(spark)
hyperspace: com.microsoft.hyperspace.Hyperspace = com.microsoft.hyperspace.Hyperspace@515a3572scala> import com.microsoft.hyperspace.index._
import com.microsoft.hyperspace.index._scala> hyperspace.createIndex(employees, IndexConfig("deltaIndex", indexedColumns = Seq("name"), includedColumns = Seq("salary")))

4.2 Hyperspace日志

$ spark-2.4.2-bin-hadoop2.7 $ tree spark-warehouse
spark-warehouse
└── indexes└── deltaIndex├── _hyperspace_log│   ├── 0│   ├── 1│   └── latestStable└── v__=0├── _SUCCESS├── part-00071-46fed65d-a011-4d69-9267-30c40c623a78_00071.c000.snappy.parquet├── part-00164-46fed65d-a011-4d69-9267-30c40c623a78_00164.c000.snappy.parquet├── part-00165-46fed65d-a011-4d69-9267-30c40c623a78_00165.c000.snappy.parquet└── part-00169-46fed65d-a011-4d69-9267-30c40c623a78_00169.c000.snappy.parquet

4.3 索引的Explain

scala> val query = employees.filter(employees("name") === "Andy").select("salary")
query: org.apache.spark.sql.DataFrame = [salary: bigint]scala> query.explain
== Physical Plan ==
*(1) Project [salary#325L]
+- *(1) Filter (isnotnull(name#324) && (name#324 = Andy))+- *(1) FileScan parquet [name#324,salary#325L] Batched: true, Format: Parquet, Location: TahoeLogFileIndex[file:/tmp/delta-table/employees], PartitionCount: 4, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string>scala> query.show
+------+
|salary|
+------+
|  4500|
+------+scala> spark.enableHyperspace
res3: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@ac52e35scala> query.explain
== Physical Plan ==
*(1) Project [salary#325L]
+- *(1) Filter (isnotnull(name#324) && (name#324 = Andy))+- *(1) FileScan Hyperspace(Type: CI, Name: deltaIndex, LogVersion: 1) [name#324,salary#325L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/spark-warehouse/indexes/..., PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string,salary:bigint>scala> query.show
+------+
|salary|
+------+
|  4500|
+------+

除了Spark DataFrame的Explain,Hyperspace自己的Explain提供了更多细节:

scala> hyperspace.explain(query, verbose = true)
=============================================================
Plan with indexes:
=============================================================
Project [salary#325L]
+- Filter (isnotnull(name#324) && (name#324 = Andy))<----+- FileScan Hyperspace(Type: CI, Name: deltaIndex, LogVersion: 1) [name#324,salary#325L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/spark-warehouse/indexes/..., PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string,salary:bigint>---->=============================================================
Plan without indexes:
=============================================================
Project [salary#325L]
+- Filter (isnotnull(name#324) && (name#324 = Andy))<----+- FileScan parquet [name#324,salary#325L] Batched: true, Format: Parquet, Location: TahoeLogFileIndex[file:/tmp/delta-table/employees], PartitionCount: 4, PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string>---->=============================================================
Indexes used:
=============================================================
deltaIndex:file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/spark-warehouse/indexes/deltaIndex/v__=0=============================================================
Physical operator stats:
=============================================================
+-----------------------------------------------------------+-------------------+------------------+----------+
|                                          Physical Operator|Hyperspace Disabled|Hyperspace Enabled|Difference|
+-----------------------------------------------------------+-------------------+------------------+----------+
|*Scan Hyperspace(Type: CI, Name: deltaIndex, LogVersion: 1)|                  0|                 1|         1|
|                                              *Scan parquet|                  1|                 0|        -1|
|                                                     Filter|                  1|                 1|         0|
|                                                    Project|                  1|                 1|         0|
|                                          WholeStageCodegen|                  1|                 1|         0|
+-----------------------------------------------------------+-------------------+------------------+----------+

5.Hyperspace进阶

5.1 Time Travel支持

追加新数据

首先我们向Delta Lake表追加一条新数据,这样就会产生一个新的版本:

scala> val columns = Seq("name", "salary")
columns: Seq[String] = List(name, salary)scala> val data = Seq(("Chen", 1000L))
data: Seq[(String, Long)] = List((Chen,1000))scala> val newEmployees = spark.createDataFrame(data).toDF(columns:_*)
newEmployees: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]scala> newEmployees.write.mode("append").partitionBy("salary").format("delta").save("/tmp/delta-table/employees")scala> val employees = spark.read.format("delta").load("/tmp/delta-table/employees")
employees: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]scala> employees.show()
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
| Justin|  3500|
|  Berta|  4000|
|   Andy|  4500|
|   Chen|  1000|
+-------+------+

可以在Delta日志里看到的确有一个新的日志和数据文件产生:

$ tree /tmp/delta-table/employees
/tmp/delta-table/employees
├── _delta_log
│   ├── 00000000000000000000.json
│   └── 00000000000000000001.json
├── salary=1000
│   └── part-00000-7c9cd83b-6602-4b9a-816c-12aee7f8eccf.c000.snappy.parquet
├── salary=3000
│   └── part-00000-e0493ade-23d4-402e-bf43-0ad727b0754e.c000.snappy.parquet
├── salary=3500
│   └── part-00000-5bae7fa3-43c3-43a7-85a2-286848b5d589.c000.snappy.parquet
├── salary=4000
│   └── part-00000-090c7267-5a47-439a-8ca6-3ae670969ac3.c000.snappy.parquet
└── salary=4500└── part-00000-375f485f-d021-4625-b598-1f86b3ad8953.c000.snappy.parquet$ cat /tmp/delta-table/employees/_delta_log/00000000000000000001.json
{"commitInfo":{"timestamp":1639161717370,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[\"salary\"]"},"readVersion":0,"isBlindAppend":true,"operationMetrics":{"numFiles":"1","numOutputBytes":"404","numOutputRows":"1"}}}
{"add":{"path":"salary=1000/part-00000-7c9cd83b-6602-4b9a-816c-12aee7f8eccf.c000.snappy.parquet","partitionValues":{"salary":"1000"},"size":404,"modificationTime":1639161717341,"dataChange":true}}

Time Travel

Hyperspace能感知到Delta Lake表的版本变化,并相应的采用之前版本的索引数据,这样就保证了查询结果的正确性。

scala> val oldEmployees = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table/employees")
oldEmployees: org.apache.spark.sql.DataFrame = [name: string, salary: bigint]scala> oldEmployees.show()
+-------+------+
|   name|salary|
+-------+------+
|Michael|  3000|
| Justin|  3500|
|  Berta|  4000|
|   Andy|  4500|
+-------+------+scala> oldEmployees.filter(oldEmployees("name") === "Andy").select("salary").show()
+------+
|salary|
+------+
|  4500|
+------+scala> oldEmployees.filter(oldEmployees("name") === "Andy").select("salary").explain()
== Physical Plan ==
*(1) Project [salary#1180L]
+- *(1) Filter (isnotnull(name#1179) && (name#1179 = Andy))+- *(1) FileScan Hyperspace(Type: CI, Name: deltaIndex, LogVersion: 1) [name#1179,salary#1180L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/spark-warehouse/indexes/..., PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string,salary:bigint>

5.2 增量刷新

Hyperspace提供了全量和增量刷新索引数据的API,像下面这样增量刷新后,索引就能应用到最新的Delta Lake表了:

scala> hyperspace.refreshIndex("deltaIndex", "incremental")scala> val query = employees.filter(employees("name") === "Andy").select("salary")
query: org.apache.spark.sql.DataFrame = [salary: bigint]scala> query.explain
== Physical Plan ==
*(1) Project [salary#5066L]
+- *(1) Filter (isnotnull(name#5065) && (name#5065 = Andy))+- *(1) FileScan Hyperspace(Type: CI, Name: deltaIndex3, LogVersion: 3) [name#5065,salary#5066L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/spark-warehouse/indexes/..., PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string,salary:bigint>

5.3 Hybrid Scan

因为Hybrid Scan目前不支持分区表,我们首先重新创建一张无分区的Delta Lake表。

scala> val employees = spark.read.json("examples/src/main/resources/employees.json")
scala> employees.write.format("delta").save("/tmp/delta-table/employees3")
scala> val employees3 = spark.read.format("delta").load("/tmp/delta-table/employees3")scala> hyperspace.createIndex(employees, IndexConfig("deltaIndex3", indexedColumns = Seq("name"), includedColumns = Seq("salary")))
scala> newEmployees.write.mode("append").format("delta").save("/tmp/delta-table/employees3")scala> val employees3 = spark.read.format("delta").load("/tmp/delta-table/employees3")

Hybrid Scan默认是关闭的,所以我们要先开启它,同时修改应用它的最大比例。这个最大比例默认是0.3,即改动文件个数/总文件数>0.3时,Hybrid Scan不会生效。这样避免了大量索引文件过期时,强制使用Hybrid Scan可能导致的性能问题。因为我们新建的未分区表只有一个数据文件,追加一条数据产生一个新文件,此时这个比例相当于0.5,所以必须先修改才能看到Hybrid Scan的效果。

scala> spark.conf.set("spark.hyperspace.index.hybridscan.enabled", "true")
scala> spark.conf.set("spark.hyperspace.index.hybridscan.maxAppendedRatio", "1.0")

现在万事俱备,再次查询就会看到,这次不同于5.1里的现象。即便我们未刷新索引数据,它还是可以应用到最新的Delta Lake表上。

scala> val query = employees3.filter(employees3("name") === "Andy").select("salary")
query: org.apache.spark.sql.DataFrame = [salary: bigint]scala> query.explain
== Physical Plan ==
*(1) Project [salary#5571L]
+- *(1) Filter (isnotnull(name#5570) && (name#5570 = Andy))+- *(1) FileScan Hyperspace(Type: CI, Name: deltaIndex3, LogVersion: 3) [name#5570,salary#5571L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/daichen/Software/spark-2.4.2-bin-hadoop2.7/spark-warehouse/indexes/..., PartitionFilters: [], PushedFilters: [IsNotNull(name), EqualTo(name,Andy)], ReadSchema: struct<name:string,salary:bigint>

参考引用

  1. Spark

    1.1 Apache Spark previous version download: https://archive.apache.org/dist/spark/spark-2.4.2/
    1.2 Delta Lake and Spark version compatibility: https://github.com/delta-io/delta/releases
    1.3 Data Frame API: https://spark.apache.org/docs/2.3.1/sql-programming-guide.html#running-sql-queries-programmatically

  2. Delta Lake

    2.1 Paper: https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf
    2.2 Quick start: https://docs.delta.io/latest/quick-start.html#language-scala
    2.3 CRUD doc: https://docs.delta.io/latest/delta-batch.html, https://docs.delta.io/latest/delta-utility.html#-delta-detail&language-scala

  3. Hyperspace

    3.1 Paper: http://vldb.org/pvldb/vol14/p3043-potharaju.pdf
    3.2 Quick start: https://microsoft.github.io/hyperspace/docs/ug-quick-start-guide/
    3.3 Delta Lake integration: https://microsoft.github.io/hyperspace/docs/ug-supported-data-formats/
    3.4 User guide: https://docs.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-performance-hyperspace?pivots=programming-language-scala
    3.5 Hyperspace index settings: https://github.com/microsoft/hyperspace/blob/v0.4.0/src/main/scala/com/microsoft/hyperspace/index/IndexConstants.scala

Hyperspace初体验:Delta Lake表索引相关推荐

  1. 阿里云 EMR Delta Lake 在流利说数据接入中的架构和实践

    简介: 为了消灭数据孤岛,企业往往会把各个组织的数据都接入到数据湖以提供统一的查询或分析.本文将介绍流利说当前数据接入的整个过程,期间遇到的挑战,以及delta在数据接入中产生的价值. 背景 流利说目 ...

  2. 湖仓一体技术调研(Apache Hudi、Iceberg和Delta lake对比)

    湖仓一体技术调研(Apache Hudi.Iceberg和Delta lake对比) 作者:程哥哥.刘某迎 .杜某安.刘某.施某宇.严某程 1 引 言 ​ 随着当前的大数据技术逐步革新,企业对单一的数 ...

  3. 全面介绍数砖开发 Delta Lake 的第一篇论文

    今年八月,Delta Lake 的第一篇论文发布了,我当时写了个总体介绍:Delta Lake 第一篇论文发布了,感兴趣的朋友可以先看总体介绍,再来详细了解一下本篇论文.因为篇幅较长,全文超3万字,建 ...

  4. 【详谈 Delta Lake 】系列技术专题 之 Streaming(流式计算)

    简介: 本文翻译自大数据技术公司 Databricks 针对数据湖 Delta Lake 的系列技术文章.众所周知,Databricks 主导着开源大数据社区 Apache Spark.Delta L ...

  5. 【详谈 Delta Lake 】系列技术专题 之 特性(Features)

    简介: 本文翻译自大数据技术公司 Databricks 针对数据湖 Delta Lake 的系列技术文章.众所周知,Databricks 主导着开源大数据社区 Apache Spark.Delta L ...

  6. 基于Delta Lake构建数据湖仓体系

    直播回放地址:https://developer.aliyun.com/live/249789 导读: 今天很高兴能与大家分享如何通过 Delta Lake 构建湖仓架构. 全文将围绕以下四个部分展开 ...

  7. 支持delete吗_Spark Delta Lake 0.4.0 发布,支持 Python API 和部分 SQL

    Apache Spark 发布了 Delta Lake 0.4.0,主要支持 DML 的 Python API.将 Parquet 表转换成 Delta Lake 表 以及部分 SQL 功能. 下面详 ...

  8. Delta Lake——数据湖的可靠性

    分享一位大神关于 Delta Lake 的演讲内容.这位是 Apache Spark 的 committer 和 PMC 成员,也是 Spark SQL 的最初创建者,目前领导 Databricks ...

  9. 【详谈 Delta Lake 】系列技术专题 之 湖仓一体( Lakehouse )

    简介: 本文翻译自大数据技术公司 Databricks 针对数据湖 Delta Lake 的系列技术文章.众所周知,Databricks 主导着开源大数据社区 Apache Spark.Delta L ...

  10. 【实践案例】Databricks 数据洞察 Delta Lake 在基智科技(STEPONE)的应用实践

    简介: 获取更详细的 Databricks 数据洞察相关信息,可至产品详情页查看:https://www.aliyun.com/product/bigdata/spark 作者 高爽,基智科技数据中心 ...

最新文章

  1. FAST-LIO2:快速直接的激光雷达与惯导里程计
  2. CSS实现文本超过指定长度显示省略号
  3. Java基础 HashMap实现原理及方法
  4. 首次提出“智能经济形态”,与实体经济深度融合
  5. 【笔记】HybridApp中使用Promise化的JS-Bridge
  6. 大型情感剧集Selenium:8_selenium网页截图的四种方法
  7. idea无法创建class
  8. TIDB介绍 新数据库趋势
  9. spring aop聊点不一样的东西
  10. 精美的拟态个人主页源码
  11. 万能启动利器FbinstTool引导工具教程
  12. 七人表决器VHDL语言
  13. Linux共享文件夹的建立和使用
  14. HTML打地鼠小游戏代码
  15. 数据基础-索引和完整性(约束)
  16. MiL.k x Bithumb x Yanolja宣布建立三方商务及市场营销合作关系
  17. 51NOD 1278 相离的圆(二分 + 排序 好题)
  18. redis mysql qps_测算Redis处理实际生产请求的QPS/TPS
  19. 使用Excel中的插入对象功能在Excel中插入Word文档
  20. 基于Python模仿流量攻击的方法对字节编码攻击

热门文章

  1. 用友软件计算机时间格式,如何正确设置系统日期格式?_速达软件_用友,速达,管家婆,微软,金蝶,方正,博世通,数据博士,进销存,财务软件-飞鸿软件帮助中心 -...
  2. VS-code输入感叹号没有提示
  3. 必须精力充沛,才扛得住世事艰难
  4. C语言入门题库——求数列2/1+3/2+5/3......的和
  5. Domain Adaptation and Graph Neural Networks
  6. BZOJ2794[Poi2012]Cloakroom——离线+背包
  7. 经典算法题(中级)-----自定义函数之字符类型统计
  8. windows10系统插耳机有回声解决办法?
  9. IDEA告警:Field can be converted to a local varible
  10. 实体关系图 (ERD) 指南