工程师的成熟模型_数据工程师的成熟度

工程师的成熟模型

数据科学与机器学习 (DATA SCIENCE AND MACHINE LEARNING)

What does a data engineer do?

数据工程师做什么？

Let’s start with three big wars that we need to understand before understanding what a data engineer does.

让我们从理解数据工程师的工作之前需要理解的三场大战开始。

Data mining, big data, and data pipeline.

数据挖掘，大数据和数据管道。

Data mining means pre-processing and extracting some knowledge from the data, so we use some data to extract knowledge. Big data contains lots of data and variables. Those data are enormous, and you need to have it running on cloud computing or multiple computers such as AWS¹, Azure², and Google Cloud³ because they have a lot of machines and storage to store that data.

数据挖掘意味着预处理并从数据中提取一些知识，因此我们使用一些数据来提取知识。大数据包含大量数据和变量。这些数据非常庞大，您需要使其在云计算或多台计算机(例如AWS¹，Azure²和GoogleCloud³)上运行，因为它们有很多计算机和存储设备来存储该数据。

Normally, Big Data is not stored in one machine. This usually happens because the dataset gets so big. Having data in a database like MySQL or Postgres or any database becomes complicated when it is in one single machine. New technologies were invented to solve this problem, like Hadoop⁴ and NoSQL⁵.

通常，大数据不存储在一台计算机中。这通常是因为数据集变得很大。当数据放在一台机器上时，将数据存储在MySQL或Postgres这样的数据库中或任何数据库中会变得很复杂。发明了解决该问题的新技术，例如Hadoop⁴和NoSQL⁵。

A data pipeline is essentially a pipeline that a data engineer built. The fact, you need to extract information from this data using data mining. Data engineers need to make a pipeline that allows data to flow from unknown large amounts of data to a more useful form.

数据管道本质上是数据工程师构建的管道。实际上，您需要使用数据挖掘从此数据中提取信息。数据工程师需要建立一条管道，以使数据能够从未知的大量数据流变为更有用的形式。

Data engineers essentially create a data pipeline where all the information comes from different devices like IoT devices, mobile applications, web apps, cameras, cars, and anything that collects data and stores information or logs data into servers or to the cloud.

数据工程师实质上是创建一条数据管道，其中所有信息都来自不同的设备，例如IoT设备，移动应用程序，Web应用程序，摄像头，汽车以及任何收集数据并将信息存储或将数据记录到服务器或云中的东西。

Data engineers accumulate all this information into nicely packed databases and store engines so that different parts of the company can create visualizations. They can monitor the performance of their product, get business insights, make business decisions from this data, and even use this data on their apps — for example, for user profiles.

数据工程师将所有这些信息累积到包装精美的数据库中，并存储引擎，以便公司的不同部门可以创建可视化。他们可以监视产品的性能，获得业务见解，根据这些数据制定业务决策，甚至可以在其应用程序上使用此数据(例如，用于用户个人资料)。

Before a company is looking for a data scientist, machine learning expert, business intelligence, or data analyst, they need to hire a data engineer to build the pipeline. Data engineers bring in all the information organized to do data modeling. They help with the data collection part. Usually, a machine learning engineer or data scientist doesn’t have to be concerned about the data pipeline.

在公司寻找数据科学家，机器学习专家，商业智能或数据分析师之前，他们需要聘请数据工程师来构建管道。数据工程师将组织的所有信息引入进行数据建模。他们帮助数据收集部分。 通常，机器学习工程师或数据科学家不必担心数据管道。

In practice, data engineers start with the process called data ingestion, which collects data from various sources and ingests them into what we call a data lake. A data lake is a collection of data. However, we don’t want the lake to overflow dry up. We need to perform something called data transformation, which is converting data from one format to another, usually into something we call a data warehouse.

实际上，数据工程师从称为数据摄取的过程开始，该过程从各种来源收集数据并将其摄取到我们所谓的数据湖中。 数据湖是数据的集合 。但是，我们不希望湖水溢出干dry。我们需要执行一种称为数据转换的操作，即将数据从一种格式转换为另一种格式，通常将其转换为我们称为数据仓库的内容。

A data warehouse is a place that saves accessible data that is useful for the business. Before placing the data into a data warehouse, data engineers look into raw data and uses some parts of the data that are useful and then put it into a data warehouse so that other parts of the business can use it. We can assume that a data lake is a pool of raw data. That means that data lakes are usually less organized and have less filtration than something like a data warehouse.

数据仓库是保存对企业有用的可访问数据的地方。在将数据放入数据仓库之前，数据工程师会研究原始数据并使用一些有用的数据部分，然后将其放入数据仓库中，以便业务的其他部分可以使用它。我们可以假设数据湖是原始数据的池。这意味着，与诸如数据仓库之类的数据湖相比，数据湖通常组织性较低，过滤性也较低。

The question is, why would businesses want to do that?

问题是，为什么企业要这样做？

It’s a lot easier to analyze data when it’s organized. We might have data in data lakes that we don’t need. However, we save on storage space in the data warehouse because we don’t have to store all the data and only store the data structure. Building data warehouse infrastructure is expensive; therefore, we can save money with this kind of data management.

整理数据后，分析数据要容易得多。我们可能在不需要的数据湖中存储了数据。但是，我们节省了数据仓库中的存储空间，因为我们不必存储所有数据，而只需存储数据结构。建立数据仓库基础架构非常昂贵；因此，我们可以通过这种数据管理节省资金。

To review, a data engineer built this pipeline of taking the data production and data capture using data engineering practices to build this pipeline. So that data can now be analyzed by data scientists and data analysts.

回顾一下，数据工程师使用数据工程实践构建了该数据采集和数据捕获管道，以构建该管道。因此，数据科学家和数据分析师现在可以分析数据。

What kind of tools do data engineers use?

数据工程师使用哪种工具？

You may have heard of Apache Kafka⁶, Hadoop⁴, Amazon S3¹, or Azure Data Lake². These are programs that have been built by engineers to carry large amounts of data like a data lake. There are also tools like Google Big Query², Amazon Redshift¹, and Amazon Athena¹. These are data warehouses that allow engineers to make queries or analyze the structure data.

您可能听说过ApacheKafka⁶，Hadoop⁴，AmazonS3¹或Azure DataLake²。这些程序是由工程师构建的，用于承载大量数据，例如数据湖。还有一些工具，例如Google BigQuery²，AmazonRedshift¹和AmazonAthena¹。这些是数据仓库，允许工程师进行查询或分析结构数据。

In this whole system, we’ve study that the data engineer creates this entire system for business. They use different tools and programs to ingest data and then put it into a data lake or a data warehouse. As a data scientist and a machine learning expert, which data do you use? Most of the time, you would be working with a data lake because if you’re doing machine learning, the more information you have, the better.

在整个系统中，我们研究了数据工程师为业务创建了整个系统。他们使用不同的工具和程序来提取数据，然后将其放入数据湖或数据仓库中。作为数据科学家和机器学习专家，您使用哪些数据？大多数时候，您将使用数据湖，因为如果您正在进行机器学习，则拥有的信息越多越好。

With machine learning, you can use structured or unstructured data to go into a data lake and grab a bunch of data to use for your models, whether in CSV forms or any other forms. Usually, data warehouses are used by business intelligence people or business analysts or data analysts to make visualization or analyze data because the data warehouse usually has more structured data that has been cleaned out.

通过机器学习，您可以使用结构化或非结构化数据进入数据湖，并获取一堆数据以供模型使用，无论是CSV形式还是任何其他形式。通常，商业情报人员或业务分析师或数据分析师使用数据仓库进行可视化或分析数据，因为数据仓库通常具有已清理的结构化数据。

As a data scientist, you can use the data from a data warehouse. This isn’t just a rule; it’s usually you use whatever data is useful to you. They use as much data that they can, as many valuable data as they can. In contrast, somebody like a business intelligence person or a data analyst already has the data cleaned processed by a data engineer and use something like a data warehouse to analyze data.

作为数据科学家，您可以使用数据仓库中的数据。这不只是一条规则；通常您会使用对您有用的任何数据。他们使用尽可能多的数据，并尽可能多地使用有价值的数据。相反，诸如商业智能人员或数据分析师之类的人已经将数据工程师处理过的数据清理干净，并使用诸如数据仓库之类的东西来分析数据。

Something like Google’s Big Query² precisely does that. It allows somebody with not much engineering experience or programming experience to analyze it in a data warehouse. Typically, software engineers, software developers, app developers, mobile developers build programs and apps that users and customers use. A data engineer would then make this piping and pipelining to ingest data and store it in different services like Hadoop⁴ or Google Big Query². Then the rest of the business can access data.

像Google的BigQuery²之类的工具正是这样做的。它允许没有太多工程经验或编程经验的人在数据仓库中进行分析。通常，软件工程师，软件开发人员，应用程序开发人员，移动开发人员会构建用户和客户使用的程序和应用程序。然后，数据工程师将进行这种管道传输以吸收数据并将其存储在Hadoop in或Google BigQuery²等不同服务中。然后，其余业务可以访问数据。

We also have data scientists who use the data lake and the data scientists to extract information and deliver some business value. Finally, we have data analysts or business intelligence to use data warehouse or structured data to derive business value.

我们也有使用数据湖的数据科学家和数据科学家提取信息并提供一些业务价值。最后，我们有数据分析师或商业智能来使用数据仓库或结构化数据来获取业务价值。

Nowadays, the industry is fast evolving, and there’s some overlap. Sometimes job descriptions might say be different from the other, but they are general simplified rules that you can use to understand how each role plays into the part of a company.

如今，该行业正在快速发展，并且存在一些重叠。有时职务说明可能会彼此不同，但它们是一般的简化规则，您可以用来了解每个角色如何参与公司的工作。

结论 (Conclusion)

There are three main tasks as data engineers. First, data engineers build an extract transform load pipeline, also known as ETL. Unlike data ingestion, which means moving data from one place to another, and ETL pipeline is the idea that a data engineer extracts data that has been generated by all of these systems. They extract data, and then they transform the data into a useful form that can be loaded into a data warehouse. So, data can be used by the rest of the company, and they used programming languages like Python, Go, Scala, and Java to accomplish these ETL jobs.

作为数据工程师，有三个主要任务。首先，数据工程师构建提取转换负载管道，也称为ETL 。与数据摄取不同，后者意味着将数据从一个地方移到另一个地方，而ETL管道的思想是数据工程师提取所有这些系统生成的数据。他们提取数据，然后将数据转换成可以加载到数据仓库中的有用形式。因此，数据可以由公司的其他部门使用，并且他们使用Python，Go，Scala和Java等编程语言来完成这些ETL作业。

Next, data engineers also build analysis tools to understand how company systems work. A data engineer needs to make sure that when any part of the system breaks, it is notified. Data engineers allow data scientists, data analysts, and business intelligence people to use tools to analyze the data and ensure that the system they’ve put in place is running correctly.

接下来，数据工程师还将构建分析工具，以了解公司系统的工作方式。数据工程师需要确保在系统的任何部分发生故障时都会得到通知。数据工程师允许数据科学家，数据分析师和商业智能人员使用工具来分析数据，并确保已安装的系统正确运行。

Finally, their third main task is obviously to maintain the data warehouse and data lakes, which is making sure that everything in there is accessible for other parts of the companies to use.

最后，他们的第三个主要任务显然是维护数据仓库和数据湖，这将确保其中的所有内容可供公司的其他部分使用。

Now, you have a high-level overview of what a data engineer does. However, this landscape is fast changing because new tools are always popping up. So my advice is don’t take as the absolute must know for all data engineers instead see that they exist. Furthermore, it looks like the role of data engineers will be replaced by data scientists. Go and read some of their documentation, only learn or use them once the need arises because they’re regularly updated, and the world of data engineering is fast-paced right now.

现在，您对数据工程师的工作有了一个高层次的概述。但是，这种情况正在快速变化，因为总是弹出新工具。因此，我的建议不成立，因为所有数据工程师都必须了解这些绝对知识，而应看到它们的存在。此外，看起来数据工程师的角色将由数据科学家取代。去阅读他们的一些文档，仅在需要时才学习或使用它们，因为它们会定期更新，并且数据工程的世界现在是快节奏的。

关于作者 (About the Author)

Wie Kiang is a researcher who is responsible for collecting, organizing, and analyzing opinions and data to solve problems, explore issues, and predict trends.

Wie Kiang是一名研究人员，负责收集，组织和分析意见和数据以解决问题，探索问题和预测趋势。

He is working in almost every sector of Machine Learning and Deep Learning. He is carrying out experiments and investigations in a range of areas, including Convolutional Neural Networks, Natural Language Processing, and Recurrent Neural Networks.

他几乎在机器学习和深度学习的每个领域工作。他正在许多领域进行实验和研究，包括卷积神经网络，自然语言处理和递归神经网络。

#1 Amazon Web Services#2 Microsoft Azure#3 Google Cloud#4 Apache Hadoop#5 NoSQL#6 Apache Kafka

翻译自: https://towardsdatascience.com/the-maturity-of-data-engineers-2c9e2bfcee09

工程师的成熟模型

查看全文

http://www.taodudu.cc/news/show-997611.html

scrape创建_确实在2分钟内对Scrape公司进行了评论和评分
如何不认识自己
plotly python_使用Plotly for Python时的基本思路
java项目经验行业_行业研究以及如何炫耀您的项目
数据科学 python_适用于数据科学的Python vs（和）R
r怎么对两组数据统计检验_数据科学中最常用的统计检验是什么
深度学习概述_深度感测框架概述
为什么即使在班级均衡的情况下，准确度仍然令人困扰
接受拒绝算法_通过算法拒绝大学学位
为什么用scrum_为什么Scrum糟糕于数据科学
使用集合映射和关联关系映射_使用R进行基因ID映射
详尽kmp_详尽的分步指南，用于数据准备
SMSSMS垃圾邮件检测器的专业攻击
使用Python进行地理编码和反向地理编码
grafana 创建仪表盘_创建仪表盘前要问的三个问题
大数据对社交媒体的影响_数据如何影响媒体，广告和娱乐职业
python 装饰器装饰类_5分钟的Python装饰器指南
机器学习实际应用_机器学习的实际好处是什么？
mysql 时间推移_随着时间的推移可视化COVID-19新案例
海量数据寻找最频繁的数据_寻找数据科学家的“原因”
kaggle比赛数据_表格数据二进制分类：来自5个Kaggle比赛的所有技巧和窍门
netflix_Netflix的Polynote
气流与路易吉，阿戈，MLFlow，KubeFlow
顶级数据恢复_顶级R数据科学图书馆
大数据 notebook_Dockerless Notebook：数据科学期待已久的未来
微软大数据_我对Microsoft的数据科学采访
如何击败腾讯_击败股市
如何将Jupyter Notebook连接到远程Spark集群并每天运行Spark作业？
twitter 数据集处理_Twitter数据清理和数据科学预处理
使用管道符组合使用命令_如何使用管道的魔力