重点 (Top highlight)

It is incredible how fast data processing tools and technologies are evolving. And with it, the nature of the data engineering discipline is changing as well. Tools I am using today are very different from what I used ten or even five years ago, however, many lessons learned are still relevant today.

令人难以置信的是,数据处理工具和技术的发展日新月异。 随之,数据工程学科的性质也在发生变化。 我今天使用的工具与我十年甚至五年前所使用的工具有很大的不同,但是,今天仍然吸取了很多教训。

I have started to work in the data space long before data engineering became a thing and data scientist became the sexiest job of the 21st century. I ‘officially’ became a big data engineer six years ago, and I know firsthand the challenges developers with a background in “traditional” data development have going through this journey. Of course, this transition is not easy for software engineers either, it is just different.

在数据工程成为事物和数据科学家成为21世纪最性感的工作之前,我已经开始在数据空间工作 。 六年前,我“正式”成为一名大数据工程师,并且我第一手知道拥有“传统”数据开发背景的开发人员在此过程中所面临的挑战。 当然,对于软件工程师来说,这种过渡也不容易,只是有所不同。

Even though technologies keep changing — and that’s the reality for anyone working in the tech industry — some of the skills I had to learn are still relevant, but often overlooked by data developers who are just starting to make the transition to data engineering. These usually are the skills that software developers often take for granted.

即使技术在不断变化-这对从事技术行业的每个人来说都是现实-我必须学习的一些技能仍然具有相关性,但常常被刚开始向数据工程过渡的数据开发人员所忽略。 这些通常是软件开发人员常理所当然的技能。

In this post, I will talk about the evolution of data engineering and what skills “traditional” data developers might need to learn today (Hint: it is not Hadoop).

在这篇文章中,我将讨论数据工程的发展以及当今“传统”数据开发人员可能需要学习哪些技能(提示:它不是Hadoop)。

数据工程师的诞生。 (The birth of the data engineer.)

Data teams before the Big Data craze were composed of business intelligence and ETL developers. Typical BI / ETL developer activities involved moving data sets from location A to location B (ETL) and building the web-hosted dashboards with that data (BI). Specialised technologies existed for each of those activities, with the knowledge concentrated within the IT department. However, apart from that, BI and ETL development had very little to do with software engineering, the discipline which was maturing heavily at the beginning of the century.

大数据热潮之前的数据团队由商业智能和ETL开发人员组成。 典型的BI / ETL开发人员活动涉及将数据集从位置A移动到位置B(ETL),并使用该数据构建Web托管的仪表板(BI)。 每种活动都有专门的技术,其知识集中在IT部门内。 但是,除此之外,BI和ETL的开发与软件工程无关,该学科在本世纪初已日趋成熟。

As the data volumes grew and interest in data analytics increased, in the past ten years, new technologies were invented. Some of them died, and others became widely adopted, that in turn changed demands in skills and teams’ structures. As modern BI tools allowed analysts and business people to create dashboards with minimal support from IT teams, data engineering became a new discipline, applying software engineering principles to ETL development using a new set of tools.

随着数据量的增长和对数据分析的兴趣的增加,在过去十年中,发明了新技术。 其中一些人死亡,而另一些人则被广泛采用,从而改变了对技能和团队结构的要求。 随着现代BI工具允许分析人员和业务人员在IT团队的最少支持下创建仪表板,数据工程已成为一门新学科,它使用一套新工具将软件工程原理应用于ETL开发。

挑战。 (The challenges.)

Photo by Alexander Dummer from Pexels
Pexels的Alexander Dummer 摄

Creating a data pipeline may sound easy, but at big data scale, this meant bringing together a dozen different technologies (or more!). A data engineer had to understand a myriad of technologies in-depth, pick the right tool for the job and write code in Scala, Java or Python to create resilient and scalable solutions. A data engineer had to know their data to be able to create jobs which benefit from the power of distributed processing. A data engineer had to understand the infrastructure to be able to identify reasons for failed jobs.

创建数据管道可能听起来很容易,但是在大数据规模上,这意味着将十多种不同的技术(甚至更多)融合在一起。 数据工程师必须深入了解各种技术,选择合适的工具来完成工作,并用Scala,Java或Python编写代码才能创建弹性和可扩展的解决方案。 数据工程师必须了解他们的数据才能创建受益于分布式处理功能的作业。 数据工程师必须了解基础架构,才能确定作业失败的原因。

Conceptually, many of those data pipelines were typical ETL jobs — collecting data sets from a number of data sources, putting them in a centralised data store ready for analytics and transforming them for business intelligence or machine learning. However, “traditional” ETL developers didn’t have the necessary skills to perform these tasks in the Big Data world.

从概念上讲,这些数据管道中的许多都是典型的ETL作业-从许多数据源收集数据集,将它们放入集中的数据存储中以备分析,并将其转换为商业智能或机器学习。 但是,“传统” ETL开发人员没有在大数据世界中执行这些任务所需的技能。

今天还是这样吗? (Is it still the case today?)

I have reviewed many articles describing what skills data engineers should have. Most of them advise learning technologies like Hadoop, Spark, Kafka, Hive, HBase, Cassandra, MongoDB, Oozie, Flink, Zookeeper, and the list goes on.

我阅读了许多描述数据工程师应具备的技能的文章。 他们中的大多数建议使用学习技术,例如Hadoop,Spark,Kafka,Hive,HBase,Cassandra,MongoDB,Oozie,Flink,Zookeeper,并且清单还在继续。

While I agree that it won’t hurt to know these technologies, I find that in many cases today, in 2020, it is enough to “know about them” — what particular use cases they are designed to solve, where they should or shouldn’t be used and what are the alternatives. Rapidly evolving cloud technology has given rise to a huge range of cloud-native applications and services in recent years. In the same way as modern BI tools made data analysis more accessible to the wider business several years ago, modern cloud-native data stack simplifies data ingestion and transformation tasks.

虽然我同意知道这些技术不会有任何伤害,但我发现在今天的许多情况下,到2020年,“了解它们”就足够了–它们旨在解决哪些特定用例,应在或不应该在哪里使用不会被使用,还有什么替代品。 近年来,快速发展的云技术已经产生了各种各样的云原生应用程序和服务。 就像几年前现代BI工具使更广泛的业务更容易进行数据分析一样,现代的云原生数据堆栈简化了数据提取和转换任务。

I do not think that technologies like Apache Spark will become any less popular in the next few years as they are great for complex data transformations.

我不认为像Apache Spark这样的技术在未来几年将不再流行,因为它们非常适合复杂的数据转换。

Still, the high rate of adoption of cloud data warehouses such as Snowflake and Google BigQuery indicates that there are certain advantages they provide. One of them is that Spark requires highly specialised skills, whereas ETL solutions on top of cloud data platforms are heavily reliant on SQL skills even for big data — such roles are much easier to fill.

尽管如此,Snowflake和Google BigQuery等云数据仓库的采用率很高,这表明它们具有一定的优势。 其中之一是Spark需要高度专业化的技能,而云数据平台之上的ETL解决方案甚至严重依赖于SQL技能,即使对于大数据也是如此-这样的角色更容易填补。

数据开发人员需要具备哪些技能? (What skills do data developers need to have?)

Pixabay from Pexels提供PexelsPixabay

BI / ETL developers usually possess a strong understanding of database fundamentals, data modelling and SQL. These skills are still valuable today and mostly transferable to a modern data stack — which is leaner and easier to learn than the Hadoop ecosystem.

BI / ETL开发人员通常对数据库基础知识,数据建模和SQL有深刻的理解。 这些技能在今天仍然很有价值,并且大部分可以转移到现代数据堆栈中,比Hadoop生态系统更精简,更容易学习。

Below are three areas I often observe “traditional” data developers having gaps in their knowledge because, for a long time, they didn’t have tools and approaches software engineers did. Understanding and fixing those gaps will not take a lot of time, but might make a transition to a new set of tools much smoother.

下面是我经常观察到的三个方面,“传统”数据开发人员在知识上存在差距,因为长期以来,他们没有软件工程师所具备的工具和方法。 了解并解决这些差距不会花费很多时间,但是可能会使向新工具集的过渡更加顺畅。

  1. Version control (Git) and understanding of CI/CD pipeline

    版本控制(Git)和对CI / CD管道的理解

SQL code is a code, and as such, software engineering principles should be applied.

SQL代码是一种代码,因此,应应用软件工程原理。

  • It is important to know who, when and why changed the code重要的是要知道谁,何时以及为什么更改代码
  • Code should come with the tests which can be automatically run代码应随附可自动运行的测试
  • Code should be easily deployable to different environments代码应易于部署到不同的环境

I am a big fan of DBT — an open-source tool which brings software engineering best practices to SQL world and simplifies all these steps. It does much more than that so I strongly advise to check it out.

我是DBT的忠实拥护者-DBT是一种开放源代码工具,可将软件工程最佳实践引入SQL世界并简化所有这些步骤。 它的作用远不止于此,因此我强烈建议您检查一下。

2. Good understanding of the modern cloud data analytics stack

2.对现代云数据分析堆栈有很好的了解

We tend to stick with the tools we know because they often make us more productive. However, many challenges we are facing are not unique, and often can be solved today more efficiently.

我们倾向于坚持使用我们所知道的工具,因为它们通常会使我们更有生产力。 但是,我们面临的许多挑战并不是唯一的,并且今天通常可以更有效地解决。

It might be intimidating trying to navigate in the cloud ecosystem at first. One workaround is to learn from other companies’ experiences.

首先,尝试在云生态系统中导航可能会令人生畏。 一种解决方法是从其他公司的经验中学习。

Many successful startups are very open about their data stack and the lessons they learnt on their journey. These days, it is common to adopt a version of a cloud data warehouse and several other components for data ingestion (such as Fivetran or Segment) and data visualisation. Seeing a few architectures is usually enough to get a 10,000-foot view and know what to research further when needed — e.g. dealing with events or streaming data might be an entirely new concept.

许多成功的初创公司都非常开放地介绍其数据堆栈以及他们在旅途中汲取的教训。 如今,通常采用一个版本的云数据仓库和其他几个组件来进行数据提取(例如Fivetran或Segment)和数据可视化。 看到一些架构通常足以获得10,000英尺的视野,并在需要时知道需要进一步研究的内容-例如,处理事件或流数据可能是一个全新的概念。

3. Knowing a programming language in addition to SQL

3.除了SQL以外,还知道一种编程语言

As much as I love Scala, Python seems to be a safe bet to start with today. It is reasonably easy to pick up, loved by data scientists and supported pretty much by all components of cloud ecosystems. SQL is great for many data transformations, but sometimes it is easier to parse complex data structure with Python before ingesting it into a table or use it to automate specific steps in the data pipeline.

尽管我非常喜欢Scala,但从今天开始,Python似乎是一个安全的选择。 它很容易被数据科学家所喜爱,并且得到了云生态系统所有组件的大力支持。 SQL对于许多数据转换来说非常有用,但是有时在将其提取到表中或使用它来自动化数据管道中的特定步骤之前,更容易使用Python解析复杂的数据结构。

This is not an exhaustive list, and different companies might require different skills, what brings me to my last point …

这不是一个详尽的清单,并且不同的公司可能需要不同的技能,这使我到了最后一点……

数据工程师的角色正在改变。 (The role of a data engineer is changing.)

Photo by Christina Morillo from Pexels
Pexels的Christina Morillo 摄

Data processing tools and technologies have evolved massively over the last few years. Many of them have evolved to the point where they can easily scale as the data volume grows while working well with the “small data” too. That can significantly simplify both the data analytics stack and the skills required to use it.

过去几年中,数据处理工具和技术得到了巨大的发展。 他们中的许多人已经发展到可以随数据量的增长轻松扩展的程度,同时也可以很好地处理“小数据”。 这可以大大简化数据分析堆栈和使用它所需的技能。

Does it mean that the role of a data engineer is changing? I think so. It doesn’t mean it gets easier — business demands grow as technology advances. However, it seems that this role might become more specialised or split into a few different disciplines.

这是否意味着数据工程师的角色正在改变? 我认同。 这并不意味着它变得更容易-随着技术的进步,业务需求也在增长。 但是,似乎该角色可能会变得更加专业化或分为几个不同的学科。

New tools allow data engineers to focus on core data infrastructure, performance optimisation, custom data ingestion pipelines and overall pipeline orchestration. At the same time, data transformation code in those pipelines can be owned by anyone who is comfortable with SQL. For example, analytics engineering is starting to become a thing. This role sits at the intersection of data engineering and data analytics and focuses on data transformation and data quality. Cloud data warehouse engineering is another one.

新工具使数据工程师可以专注于核心数据基础架构,性能优化,自定义数据提取管道和总体管道编排。 同时,那些熟悉SQL的人都可以拥有这些管道中的数据转换代码。 例如, 分析工程学开始成为现实。 该角色位于数据工程和数据分析的交叉点,重点在于数据转换和数据质量。 云数据仓库工程是另一个。

Regardless of whether the distinction in job titles will become widely adopted or not, I believe that “traditional” data developers possess many fundamental skills to be successful in many data engineering related activities today — strong SQL and data modelling are some of them. By understanding the modern cloud analytics data stack and how different components can be combined together, learning a programming language and getting used to version control, this transition can be reasonably seamless.

不管职称之间的区别是否会被广泛采用,我都认为“传统”数据开发人员具有许多基本技能,可以在当今许多与数据工程相关的活动中取得成功-其中包括强大SQL和数据建模。 通过了解现代云分析数据堆栈以及如何将不同的组件组合在一起,学习编程语言并习惯于版本控制,这种过渡可以相当无缝地进行。

翻译自: https://towardsdatascience.com/data-engineering-in-2020-e46910786eda


http://www.taodudu.cc/news/show-4379930.html

相关文章:

  • C++学习笔记(六)
  • linux 基金会 认证,Linux基金会宣布新的Linux认证计划
  • 一个读取图片的代码
  • Material Design【Android-Toolbar,滑动菜单,悬浮按钮,卡片布局,下拉刷新和可折叠式标题栏及案例】
  • Real-Time Rendering 4th 译文《六 纹理(下)》
  • Spring使用指南 ~ 4、ApplicationContext 配置详解
  • 听歌学日语6 拗音
  • 听歌学日语2 五十音图 たなは行
  • 五十音图(平假名)
  • 2876: [Noi2012]骑行川藏 - BZOJ
  • bzoj2876 [Noi2012]骑行川藏 [二分+拉格朗日乘数法]
  • 高等数学(拉格朗日乘子法):NOI 2012 骑行川藏
  • [NOI2012]
  • bzoj2876: [Noi2012]骑行川藏 :拉格朗日乘数法
  • bzoj 2876: [Noi2012]骑行川藏 二分+拉格朗日乘数法
  • 2876: [Noi2012]骑行川藏
  • BZOJ2876: [Noi2012]骑行川藏
  • bzoj 2876: [Noi2012]骑行川藏 拉格朗日乘子法
  • 【理论恒叨】【立体匹配系列】经典SGM:(1)匹配代价计算之互信息(MI)
  • [BZOJ2876] [NOI2012]骑行川藏
  • 大毕业什么都没学到 就是收藏了超级实用的130个网站!!!
  • bzoj 2876: [Noi2012]骑行川藏 拉格朗日数乘
  • BZOJ2876 [Noi2012]骑行川藏
  • 牛客每日练习----骑行川藏,Lucky Coins,不凡的夫夫
  • 蛋蛋弹车3-具有功能安全EPS系统设计(电机控制算法-PID)
  • BZOJ2437: [Noi2011]兔兔与蛋蛋
  • pycharm关联git
  • 自关联
  • idea关联svn
  • mongodb关联查询

2020年的数据工程相关推荐

  1. 独家 | 展望未来:数据科学、数据工程及技术(附链接)

    作者:SeattleDataGuy (Zack Shapiro)翻译:殷之涵 校对:欧阳锦本文约2800字,建议阅读8分钟本文通过6位科技工作者的观察及感受,为大家介绍2021年即将发生在数据科学及数 ...

  2. sql isnull怎么没用_SQL语言在数据工程(Data Engineering)中的运用(一)

    我与SQL的纠葛好几年前就开始了,在不断炒冷饭的过程中终于接了几个不错的项目,把冷饭变成了温饭.今天就借此机会,与大家分享一下自己在项目中的心路历程,从中学习到的SQL在数据工程中的运用,代码分享,代 ...

  3. python 插补数据_python 2020中缺少数据插补技术的快速指南

    python 插补数据 Most machine learning algorithms expect complete and clean noise-free datasets, unfortun ...

  4. as cast float server sql_SQL语言在数据工程(Data Engineering)中的运用(一)

    我与SQL的纠葛好几年前就开始了,在不断炒冷饭的过程中终于接了几个不错的项目,把冷饭变成了温饭.今天就借此机会,与大家分享一下自己在项目中的心路历程,从中学习到的SQL在数据工程中的运用,代码分享,代 ...

  5. 卡耐基梅隆大学计算机工程录取率,热点:卡内基梅隆大学爆出2020年新生数据,计算机学院录取率堪比藤校...

    原标题:热点:卡内基梅隆大学爆出2020年新生数据,计算机学院录取率堪比藤校 卡内基梅隆大学(Carnegie Mellon University),简称CMU,坐落在美国宾夕法尼亚州的匹兹堡,美国2 ...

  6. 2020 BI及数据可视化领域最具商业合作价值企业盘点

    大数据产业创新服务媒体 --聚焦数据 · 改变商业 历经2个多月的时间,由数据猿工作人员与外部专家成员联合组成的评选推荐委员会,从数千家企业.机构中通过直接申报交流.外界评价.匿名访问等交叉验证的筛选 ...

  7. 第二篇:智能电网(Smart Grid)中的数据工程与大数据案例分析

    前言 上篇文章中讲到,在智能电网的控制与管理侧中,数据的分析和挖掘.可视化等工作属于核心环节.除此之外,二次侧中需要对数据进行采集,数据共享平台的搭建显然也涉及到数据的管理.那么在智能电网领域中,数据 ...

  8. CIR:2020年全球数据中心应用AOC市场达$42亿

    未来十年,QSFP和CXP将占有源光缆销售收入的大部分.到2020年,QSFP+和QSFP28销售收入将分别达到7.27亿美元和7.41亿美元. 根据CIR(CommunicationsIndustr ...

  9. “华为云杯”2020深圳开放数据应用创新大赛线上推介会成功举办,让深圳大数据在全球“跑”起来...

    4月30日下午,"华为云杯"2020深圳开放数据应用创新大赛第三场线上推介会完美落幕.至此,从4月23日开始的三场云端推介会全部结束,全球各数字平台总观看量1000多万人次.深圳市 ...

最新文章

  1. leetcode--最长连续递增序列--python
  2. Linux 常用命令全称,看看你 get 到了哪些?
  3. 矩阵计算 pdf_线性代数II: 矩阵
  4. 避免强光的一些注意点
  5. 三)mybatis 二级缓存,整合ehcache
  6. bst 删除节点_在BST中删除大于或等于k的节点
  7. Golang——切片使用大全(创建、初始化、遍历、截取、修改、添加、切片的copy、切片作为函数参数、切片求和、切片求最大值)
  8. Selenimu做爬虫 - oscarxie - 博客园
  9. 孙鑫VC学习笔记:第十五讲 (一) 进程和线程基本概念
  10. scala中的柯里化函数
  11. java验证身份证号码的合格性
  12. srs之服务搭建+OBS推流(简单记录)
  13. c++filt 命令
  14. 2015蓝桥真题(A组省赛)
  15. MySQL索引底层实现原理 MyISAM非聚簇索引 vs. InnoDB聚簇索引
  16. 《一本书读懂财报》学习笔记 - 资产是如何计价?
  17. 微信小程序开发一定要服务器么,该怎么选择小程序服务器?
  18. 互联网大佬们的代码水平如何?网友:刘强东95年一个晚上赚5万
  19. 从乙方变成甲方后,我都经历了些什么?一位女程序员故事
  20. 河南新乡:牧野区王村镇手绘文明墙巩固文明果

热门文章

  1. java 组织机构代码_JAVA实现社会统一信用代码较验
  2. access denied for user root@localhost using passw
  3. Linux:udev机制详解
  4. Android Google 账户
  5. 二、流水线的执行流程
  6. r语言写九九乘法表并保存为txt文件
  7. 使用python为Excel插入附件
  8. uni-app如何自定义内容生成二维码?
  9. 上海交通大学python期末考试样题加解析_上海交通大学python期末考试样题加解析...
  10. 喇叭、扬声器的正负极问题