marlin 三角洲

Data lakes are becoming adopted in more and more companies seeking for efficient storage of their assets. The theory behind it is quite simple, in contrast to the industry standard data warehouse. To conclude this this post explains the logical foundation behind this and presents practical use case with tool called Delta Lake. Enjoy!

数据湖正被越来越多的寻求有效存储其资产的公司采用。与行业标准数据仓库相比，其背后的理论非常简单。总结这篇文章解释了背后的逻辑基础，并用名为Delta Lake的工具提出了实际用例。请享用！

什么是数据湖？ (What is data lake?)

A centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics — from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

集中式存储库，可让您以任何规模存储所有结构化和非结构化数据。 您可以按原样存储数据，而无需先构建数据结构并运行不同类型的分析-从仪表板和可视化到大数据处理，实时分析和机器学习，以指导更好的决策。

Amazon Web Services

亚马逊网络服务

Firstly, the rationale behind data lakes is quite similar to widely used data warehouse. Although they fall into same category are quite different in the logic behind them. For instance data warehouse’s nature is that information stored inside it is already pre-processed. In other words reason for storing has to be known and data model well defined. However data lake takes different approach. As a result the reason of storing and data model don’t have to be defined. In conclusion, both variants can be compared like below:

首先，数据湖背后的原理与广泛使用的数据仓库非常相似。尽管它们属于同一类别，但它们背后的逻辑却有很大不同。例如，数据仓库的性质是存储在其中的信息已经过预处理。换句话说，必须知道存储的原因并明确定义数据模型。但是数据湖采取不同的方法。因此，不必定义存储原因和数据模型。总之，可以如下比较两种变体：

+-----------+----------------------+-------------------+|           | Data Warehouse       | Data Lake         |+-----------+----------------------+-------------------+| Data      | Structured           | Unstructured data || Schema    | Schema on write      | Schema on read    || Storage   | High-cost storage    | Low-cost storage  || Users     | Business analysts    | Data scientists   || Analytics | BI and visualization | Data Science      |+-----------+----------------------+-------------------+

使用Delta Lake OSS创建数据湖 (Using Delta Lake OSS create a data lake)

Now let’s use that theoretical knowledge and apply it using Delta Lake OSS. Delta Lake is open source framework based on Apache Spark, used to retrieve, manage and transform data into data lake. Getting started is quite simple — you will need an Apache Spark project (use this link for more guidance). Firstly, add Delta Lake as SBT dependency:

现在，让我们使用该理论知识，并使用Delta Lake OSS进行应用。 Delta Lake是基于Apache Spark的开源框架，用于检索，管理数据并将其转换为Data Lake。入门非常简单-您将需要一个Apache Spark项目(使用此链接可获得更多指导)。首先，添加Delta Lake作为SBT依赖项：

libraryDependencies += "io.delta" %% "delta-core" % "0.5.0"

将数据保存到Delta (Saving data to Delta)

Next, let’s create a first table. For this, you will need a Spark Dataframe, which can be an arbitrary set or data read from another format, like JSON or Parquet.

接下来，让我们创建第一个表。为此，您将需要一个Spark Dataframe，它可以是任意集合，也可以是从其他格式(如JSON或Parquet)读取的数据。

val data = spark.range(0, 50)data.write.format("delta").save("/data/delta-table")

从Delta读取数据 (Reading data from Delta)

Reading the data is as simple as writing to it. Just specify the path and correct format, same as you would do with CSV or JSON data.

读取数据就像写入数据一样简单。只需指定路径和正确的格式即可，就像处理CSV或JSON数据一样。

val df = spark.read.format("delta").load("/data/delta-table")df.show()

在Delta中更新数据 (Updating the data in Delta)

The Delta Lake OSS supports a range of update options, thanks to its ACID model. Let’s use that to run a batch update, that overwrite the existing data. We do this by using following code:

借助其ACID模型，Delta Lake OSS支持一系列更新选项。让我们使用它来运行批处理更新，该更新将覆盖现有数据。我们通过使用以下代码来做到这一点：

val data = spark.range(0, 100)data.write.format("delta").mode("overwrite").save("/data/delta-table")df.show()

摘要 (Summary)

I hope you have found this post useful. If so, don’t hesitate to like or share this post. Additionally you can follow me on my social media if you fancy so :)

我希望您发现这篇文章有用。如果是这样，请随时喜欢或分享此帖子。此外，如果您愿意，也可以在我的社交媒体上关注我：)

演示地址

Sources: https://docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

资料来源： https : //docs.delta.io/latest/quick-start.html https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/

翻译自: https://medium.com/swlh/delta-lake-and-data-lakes-getting-started-41ce957ed0da

marlin 三角洲

查看全文

http://www.taodudu.cc/news/show-997617.html

机器学习建立模型_建立生产的机器学习系统
风能matlab仿真_发现潜力：使用计算机视觉对可再生风能发电场的主要区域进行分类（第1部分）
实验人员考评指标_了解实验指标
nba数据库统计_NBA板块的价值-从统计学上讲
两个链接合并_如何找到两个链接列表的合并点
工程师的成熟模型_数据工程师的成熟度
scrape创建_确实在2分钟内对Scrape公司进行了评论和评分
如何不认识自己
plotly python_使用Plotly for Python时的基本思路
java项目经验行业_行业研究以及如何炫耀您的项目
数据科学 python_适用于数据科学的Python vs（和）R
r怎么对两组数据统计检验_数据科学中最常用的统计检验是什么
深度学习概述_深度感测框架概述
为什么即使在班级均衡的情况下，准确度仍然令人困扰
接受拒绝算法_通过算法拒绝大学学位
为什么用scrum_为什么Scrum糟糕于数据科学
使用集合映射和关联关系映射_使用R进行基因ID映射
详尽kmp_详尽的分步指南，用于数据准备
SMSSMS垃圾邮件检测器的专业攻击
使用Python进行地理编码和反向地理编码
grafana 创建仪表盘_创建仪表盘前要问的三个问题
大数据对社交媒体的影响_数据如何影响媒体，广告和娱乐职业
python 装饰器装饰类_5分钟的Python装饰器指南
机器学习实际应用_机器学习的实际好处是什么？
mysql 时间推移_随着时间的推移可视化COVID-19新案例
海量数据寻找最频繁的数据_寻找数据科学家的“原因”
kaggle比赛数据_表格数据二进制分类：来自5个Kaggle比赛的所有技巧和窍门
netflix_Netflix的Polynote
气流与路易吉，阿戈，MLFlow，KubeFlow
顶级数据恢复_顶级R数据科学图书馆

marlin 三角洲_三角洲湖泊和数据湖泊-入门相关推荐

全栈入门_启动数据栈入门包（2020）
全栈入门 I advise a lot of people on how to build out their data stack, from tiny startups to enterprise ...
sap wm内向交货步骤_内向型人在数据科学中成功的五个有效步骤
sap wm内向交货步骤 Just like most attributes of humans, including both the bright and dark sides, being an ...
python平稳性检验_时间序列预测基础教程系列(14)_如何判断时间序列数据是否是平稳的(Python)...
时间序列预测基础教程系列(14)_如何判断时间序列数据是否是平稳的(Python) 发布时间:2019-01-10 00:02, 浏览次数:620 , 标签: Python 导读: 本文介绍了数据平稳 ...
c语言中,x-y,'105',ab,7f8那个是正确的,C语言程序设计_第三章数据.ppt
C语言程序设计_第三章数据 * 运算符功能与运算量关系要求运算量个数要求运算量类型运算符优先级别结合方向结果的类型学习运算符应注意 * 基本算术运算符: + - * / % 结合方向: ...
成像数据更好的展示_为什么更多的数据并不总是更好
成像数据更好的展示 Over the past few years, there has been a growing consensus that the more data one has, th ...
数据科学还是计算机科学_您应该拥有数据科学博客的3个原因
数据科学还是计算机科学 "Start a Blog to cement the things you learn. When you teach what you've learned in ...
数据科学家访谈录百度网盘_您应该在数据科学访谈中向THEM提问。
数据科学家访谈录百度网盘 A quick search on Medium with the keywords "Data Science Interview" resulted ...
kfc流程管理炸薯条几秒_炸薯条成为数据科学的最后前沿
kfc流程管理炸薯条几秒 In February, our Data Science team had an argument about which restaurant we went to ma ...
数据中心细节_当细节很重要时数据不平衡
数据中心细节定义不平衡数据 (Definition Imbalanced Data) When we speak of imbalanced data, what we mean is that a ...

marlin 三角洲_三角洲湖泊和数据湖泊-入门