数据质量提升

Author Vlad Rișcuția is joined for this article by co-authors Wayne Yim and Ayyappan Balasubramanian.

作者 Vlad Rișcuția 和合著者 Wayne Yim 和 Ayyappan Balasubramanian 共同撰写了这篇文章 。

为什么要数据质量？ (Why data quality?)

Data quality is a critical aspect of ensuring high quality business decisions. An estimate of the yearly cost of poor data quality is $3.1 trillion per year for the United States alone, equating to approximately 16.5 percent of GDP.¹ For a business such as Microsoft, where data-driven decisions are ingrained within the fabric of the company, ensuring high data quality is paramount. Not only is data used to drive, steer, and grow the Microsoft business from a tactical and strategic perspective, but there are also regulatory obligations to produce accurate data for quarterly financial reporting.

数据质量是确保高质量业务决策的关键方面。据估计，仅在美国，不良数据质量的年成本就高达每年3.1万亿美元，约占GDP的16.5％。¹对于像Microsoft这样的企业，数据驱动型决策根深蒂固，确保高数据质量至关重要。从战术和战略角度来看，不仅使用数据来驱动，指导和发展Microsoft业务，而且还存在监管义务，要求为季度财务报告生成准确的数据。

DataCop的历史 (History of DataCop)

In the Experiences and Devices (E+D) division at Microsoft, a central data team called IDEAs (Insights Data Engineering and Analytics) generates key business metrics that are used to grow and steer the business. As one of its first undertakings, the team created the Office 365 Commercial Monthly Active User (MAU) measure to track the usage and growth of Office 365. This was a complicated endeavor due to the sheer scale of data, the number of Office products and services involved, and the heterogenous nature of the data pipelines across different products and services. In addition, many other business metrics, tracking the growth and usage of all Office products and services, also needed to be created.

在Microsoft的“体验和设备”(E + D)部门中，一个名为IDEA(Insights数据工程和分析)的中央数据团队生成了用于发展和指导业务的关键业务指标。作为其首批任务之一，该团队创建了Office 365商业月度活动用户(MAU)措施来跟踪Office 365的使用和增长。由于数据规模巨大，Office产品和服务的数量庞大，这是一项复杂的工作。涉及的服务以及跨不同产品和服务的数据管道的异构性质。此外，还需要创建许多其他业务指标，以跟踪所有Office产品和服务的增长和使用情况。

In the process of creating these critical business metrics, it was clear that generating them at scale and in a reliable way with high data quality was of the utmost importance, as key tactical and strategic business decisions would be based on them. In addition, because of the team’s charge to generate key metrics for release with quarterly earnings, producing high quality data was also a regulatory requirement.

在创建这些关键业务指标的过程中，很明显，以关键的战术和战略业务决策将基于它们，以高质量的数据大规模可靠地生成它们至关重要。另外，由于团队负责生成关键指标以按季度收入发布，因此生成高质量数据也是监管要求。

The IDEAs team formed as a data quality team consisting of program management, engineering, and data science representatives, and set out to investigate internal and external data quality solutions. The team examined internal data quality systems and researched public whitepapers from other companies that worked with huge amounts of data. Members of the team also spent a considerable amount of time with LinkedIn, learning about their data quality system called “Data Sentinel”² to potentially leverage what they had built, as they had already spent a considerable amount of time developing Data Sentinel and are also part of Microsoft.

IDEA团队组成了一个由程序管理，工程和数据科学代表组成的数据质量团队，并着手研究内部和外部数据质量解决方案。该团队检查了内部数据质量系统，并研究了处理大量数据的其他公司的公开白皮书。团队成员还花了很多时间在LinkedIn上，了解他们称为“ Data Sentinel”²的数据质量系统，以潜在地利用他们所构建的内容，因为他们已经花费了大量时间来开发Data Sentinel，并且微软的一部分。

The vision for a data quality platform in IDEAs was that it would be extensible, scalable, able to work with the multiple data fabrics involved, and be leveraged by the wider data science community at Microsoft. For example, data scientists and data analysts should be able to write data quality checks in languages familiar to them such as Python, R, and Scala, among others, and have these data quality checks operate reliably at scale.

IDEA中的数据质量平台的愿景是，它是可扩展的，可伸缩的，能够与所涉及的多个数据结构配合使用，并被Microsoft的更广泛的数据科学社区所利用。例如，数据科学家和数据分析人员应该能够用他们熟悉的语言(例如Python，R和Scala等)编写数据质量检查，并使这些数据质量检查可靠地大规模运行。

Another key requirement was to have the data quality platform function as a DaaS, or “Data as a Service,” resulting in the need to apply the same “service rigor” in engineering, operations, and processes that were used to create and operate Office 365, the largest SaaS in the world. This meant having very high engineering standards around change management, monitoring, security controls, and auditability, and tightly integrating with Microsoft incident management systems to ensure that systems operate with high availability, efficiency, and security.

另一个关键要求是使数据质量平台具有DaaS或“数据即服务”的功能，因此需要在用于创建和操作Office的工程，操作和流程中应用相同的“服务严格性” 365，世界上最大的SaaS。这意味着在变更管理，监视，安全控制和可审核性方面具有很高的工程标准，并与Microsoft事件管理系统紧密集成，以确保系统以高可用性，效率和安全性运行。

In the end, the team decided to build its own extensible data quality system from scratch in order for it to function with the scale and reliability of a DaaS and for it to interface with other internal Microsoft data systems. The initial functional specification was written in late 2018, and by early 2019 DataCop was born. Today, DataCop is part of the DataHub platform that also consists of Data Build and Data Catalog. Data Build generates the datasets required by the business in a compliant and scalable way and Data Catalog is a search store for all assets and surfaces with metadata such as data quality scores from DataCop, as well as access and privacy information. Future articles will describe how Data Catalog and Data Build are used to generate the metrics and insights that drive, steer, and grow the E+D business and serve as critical components of the data quality journey.

最后，团队决定从头开始构建自己的可扩展数据质量系统，以使其能够与DaaS的规模和可靠性一起运行，并与其他内部Microsoft数据系统进行交互。最初的功能规范写于2018年底，到2019年初DataCop诞生了。今天，DataCop已成为DataHub平台的一部分，该平台还包括数据构建和数据目录。 Data Build以合规且可扩展的方式生成企业所需的数据集，Data Catalog是具有元数据(例如来自DataCop的数据质量得分以及访问和隐私信息)的所有资产和表面的搜索存储。未来的文章将描述如何使用“数据目录”和“数据构建”来生成度量标准和见解，以推动，指导和发展E + D业务，并充当数据质量之旅的关键组成部分。

建筑 (Architecture)

DataCop is designed with a mindset that no one team can solve this challenge on its own. The data ecosystem at Microsoft consists of multiple data fabrics, with data arriving in minutes to a month later. The system must be flexible and simple enough for other developers across Microsoft to add plugins and workers for adding to the data fabric or quality checks they want to build on. As a result, DataCop was built as a distributed message broker based on Azure Service Bus with quality check results stored on Cosmos DB.

DataCop的设计思想是，任何团队都无法独自解决这一挑战。 Microsoft的数据生态系统由多个数据结构组成，数据在数分钟至一个月后到达。该系统必须足够灵活和简单，以使Microsoft的其他开发人员可以添加插件和工作程序，以添加到他们想要建立的数据结构或质量检查中。结果，DataCop被构建为基于Azure Service Bus的分布式消息代理，质量检查结果存储在Cosmos DB中。

Messages in the message broker must be self-contained and allow workers to work on them exclusively. This would allow messages from Orchestrator to run scheduled checks or from an Azure Data Factory (ADF) pipeline itself. Every time a data check or new fabric needs to be added, the developer can simply implement an override and develop their own worker process without affecting the rest of the system. The Azure team leveraged this to build on it quickly, as described below.

消息代理中的消息必须是独立的，并允许工作人员专门处理它们。这将允许来自Orchestrator的消息运行计划的检查，或者来自Azure数据工厂 (ADF)管道本身的消息。每次需要添加数据检查或新结构时，开发人员都可以简单地实现覆盖并开发自己的工作进程，而不会影响系统的其余部分。如下所述，Azure团队利用它来快速构建它。

*High level architectural diagram of DataCopDataCop的高级架构图*

Workers are run today as Azure Web Jobs. Workers typically leverage another compute in Azure such as Azure Databricks or Azure SQL to execute quality checks against the actual data. Workers are lightweight and used to determine whether the checks are successful. This makes Azure Web Jobs a perfect fit for running them. For consistency, Orchestrator is hosted as a web job as well. Orchestrator is a time-triggered web job that generates the sets of quality checks that need to be executed and puts them in a respective worker-specific service bus queue.

今天，工作人员作为Azure Web Jobs运行。工作人员通常利用Azure中的另一种计算(例如Azure Databricks或Azure SQL)对实际数据执行质量检查。工人很轻巧，可用来确定检查是否成功。这使得Azure Web Jobs非常适合运行它们。为了保持一致性，Orchestrator也作为Web作业托管。 Orchestrator是一个时间触发的Web作业，它生成需要执行的质量检查集，并将它们放入相应的特定于工作人员的服务总线队列中。

The next important part of any data quality system is alerting. All Microsoft services use IcM, the company-wide incident management system. Data alerts are not like service alerts: Data arrives at a higher latency compared to typical services and can be recovered in some situations. If there is a need to restate bad data, an issue can be potentially open longer until the data is restated. So, alert suppression is set to handle a very different number of cases — data not available due to upstream issues for x days should result in one alert, and data not available downstream due to a common upstream issue should be suppressed.

任何数据质量系统的下一个重要部分是警报。所有Microsoft服务都使用IcM(公司范围的事件管理系统)。数据警报与服务警报不同：与典型服务相比，数据延迟更高，并且在某些情况下可以恢复。如果需要重述错误的数据，则可能需要更长的时间才能解决该问题，直到重新陈述数据为止。因此，将警报抑制设置为处理非常不同的情况-由于x天上游问题导致的数据不可用将导致一个警报，而由于常见上游问题而导致下游数据不可用的数据将被抑制。

This is a good place to touch upon another important topic in the data quality landscape: Anomaly detection. Data volume and metrics change often and are prone to seasonality. Having an anomaly detection system that can handle seasonality helps with a move away from monitoring data volumes and daily trends to a more sophisticated system. DataCop leverages Azure anomaly detector APIs to measure completeness stats such as file size and a few key metrics along multiple dimensions. This is a work in progress with further updates to come.

这是接触数据质量领域中另一个重要主题的好地方：异常检测。数据量和指标经常更改，并且容易出现季节性变化。拥有可以处理季节性的异常检测系统有助于从监视数据量和每日趋势转变为更复杂的系统。 DataCop利用Azure异常检测器API来测量完整性统计信息，例如文件大小和沿多个维度的一些关键指标。这是一项正在进行的工作，将进行进一步的更新。

*Data quality score for data assets in the DataCop User InterfaceDataCop用户界面中数据资产的数据质量得分*

It was apparent that developers need a way to quickly author data quality checks and also deploy them. As a result, we integrated with Azure DevOps workflow to automatically deploy these data quality monitors. Today, the IDEAs team runs close to 2000 tests on about 750 key datasets that include externally reported financial metrics.

显然，开发人员需要一种快速编写数据质量检查并进行部署的方法。因此，我们与Azure DevOps工作流集成在一起，以自动部署这些数据质量监视器。如今，IDEA团队对约750个关键数据集(包括外部报告的财务指标)进行了近2000次测试。

M365与Azure之间的合作伙伴关系 (Partnership between M365 and Azure)

The Customer Growth and Analytics team (CGA) is a centralized data science team in the Cloud+AI division at Microsoft. The team’s mission is to learn from customers and empower them to make the most of Azure services.³

客户增长和分析团队(CGA)是Microsoft的Cloud + AI部门中的集中数据科学团队。该团队的任务是向客户学习，并使其能够充分利用Azure服务。³

Last year, as CGA’s scope was growing, an effort began to standardize technologies. Having a smaller number of technologies upon which CGA’s data platform is built makes it easier to move engineering resources as needed, share knowledge, and in general increase the reliability of the overall system. The use of Azure PaaS offerings reduced the need for writing custom code. The team standardized on Azure Data Factory for data movement and Azure Monitor for monitoring, among others. Unfortunately, at this writing, Azure doesn’t offer a PaaS data quality testing framework.

去年，随着CGA范围的不断扩大，人们开始努力使技术标准化。使用CGA数据平台所基于的技术数量较少，可以更轻松地根据需要移动工程资源，共享知识并总体上提高整个系统的可靠性。使用Azure PaaS产品减少了编写自定义代码的需要。该团队在Azure数据工厂(用于数据移动)和Azure监视器(用于监视)上进行了标准化。不幸的是，在撰写本文时，Azure没有提供PaaS数据质量测试框架。

CGA realized the need for a reliable and scalable data quality solution, especially as the data platform evolved to support more and more production workloads where data issues can have large impacts, and so evaluated multiple options.

CGA意识到了对可靠且可扩展的数据质量解决方案的需求，特别是随着数据平台的发展以支持越来越多的生产工作负载，其中数据问题可能会产生重大影响，因此评估了多种选择。

CGA tried out several data quality testing solutions with the code base, but quickly realized they were built for smaller projects, made some rigid assumptions, and would require significant investment to scale out to cover the entire platform.

CGA使用代码库尝试了几种数据质量测试解决方案，但很快意识到它们是为较小的项目构建的，做出了一些严格的假设，并且需要大量投资才能扩展到整个平台。

Discussions with other data science organizations within the company to see how they were handling this led to LinkedIn and an introduction to Data Sentinel. Its main limitation is that it runs exclusively on Spark. CGA must support multiple data fabrics: In some cases, different compute scenarios require the specific best solution for the job, such as Azure Data Explorer for analytics or Azure Data Lake Storage and Azure Machine Learning for ML workloads. In other cases, data ingested from other teams comes from a variety of storage locations: Azure SQL, blob storage, and Azure Data Lake Storage gen1, among others.

与公司内其他数据科学组织的讨论，以了解他们如何处理此问题，从而导致了LinkedIn和Data Sentinel的介绍。它的主要限制是它只能在Spark上运行。 CGA必须支持多种数据结构：在某些情况下，不同的计算方案需要特定的最佳解决方案来完成工作，例如用于分析的Azure Data Explorer或用于ML工作负载的Azure Data Lake Storage和Azure Machine Learning 。在其他情况下，从其他团队提取的数据来自各种存储位置：Azure SQL，blob存储和Azure Data Lake Storage gen1等。

Further outreach led to discussions with the M365 data science team and led to an introduction to DataCop, the solution described in this article. Its capabilities were compelling: Test scheduling, integration with the standard Microsoft alerting platform, and a declarative way of describing tests. Its main limitation was that DataCop didn’t support Azure Data Explorer.

进一步的扩展导致与M365数据科学团队的讨论，并导致对DataCop(本文中描述的解决方案)进行了介绍。它的功能引人注目：测试计划，与标准Microsoft警报平台的集成以及描述测试的声明方式。它的主要限制是DataCop不支持Azure Data Explorer。

Because Azure Data Explorer (ADX) is core to CGA’s platform, this could have been a showstopper, but in true One Microsoft spirit, the DataCop team was more than happy to work with CGA to light up the missing capability. The teams agreed to treat this as an “internal open source” project, with CGA contributing code to the DataCop solution from which both teams could benefit. Due to its flexible design, adding ADX capabilities was significantly easier than the alternative (investing in a home-grown solution).

因为Azure数据资源管理器(ADX)是CGA平台的核心，所以这本来可以成为热门。但是，本着一种Microsoft的精神，DataCop团队非常乐意与CGA合作以减轻缺失的功能。团队同意将其视为“内部开源”项目，CGA向DataCop解决方案贡献代码，这两个团队都可以从中受益。由于其灵活的设计，添加ADX功能比选择其他方法(投资自家解决方案)要容易得多。

*DataCop extended with Azure Data Explorer support.DataCop扩展了Azure Data Explorer支持。*

CGA deployed an instance of DataCop in its environment and over the following months had a big data quality push, including training the team on how to author tests and increasing test coverage to 100 percent of the datasets in CGA’s platform. At the time of writing, CGA has around 400 tests covering close to 300 key datasets. Over the past 30 days, CGA ran more than 4000 tests, identifying and quickly acting to mitigate multiple data issues that would have caused significant anomalies in CGA’s system. Onboarding DataCop saved significant engineering effort, which was refocused on test authoring.

CGA在其环境中部署了一个DataCop实例，并且在接下来的几个月中，数据质量得到了很大的推动，包括培训团队如何编写测试以及将测试覆盖率提高到CGA平台中100％的数据集。在撰写本文时，CGA拥有约400个测试，涵盖了近300个关键数据集。在过去的30天里，CGA运行了4000多个测试，识别并Swift采取措施来缓解可能导致CGA系统出现重大异常的多个数据问题。入职的DataCop节省了大量的工程设计工作，这些工作重新集中在测试创作上。

总结思想/总结 (Closing thoughts/summary)

This article described DataCop, the data quality solution developed by the M365 data team in partnership with the Azure data team.

本文介绍了DataCop，它是M365数据团队与Azure数据团队合作开发的数据质量解决方案。

Data quality is a critical aspect of a business, both for informing decisions and for regulatory obligations.数据质量对于通知决策和监管义务都是业务的关键方面。
The diverse data fabrics in use and their huge scale led to development of DataCop, a data quality solution for supporting the Microsoft business.使用中的各种数据结构及其巨大规模促成了DataCop的发展，DataCop是一种支持Microsoft业务的数据质量解决方案。
DataCop is a cloud-native Azure solution, consisting of a set of web jobs that communicate via service bus.DataCop是云原生的Azure解决方案，由一组通过服务总线进行通信的Web作业组成。
The plug-in architecture allowed the CGA team to quickly develop an Azure Data Explorer test runner and expand the scope of DataCop from the M365 team to also cover the Azure business.插件体系结构使CGA团队可以快速开发Azure Data Explorer测试运行程序，并从M365团队扩展DataCop的范围，以涵盖Azure业务。
Today, DataCop runs hundreds of tests every day to ensure the quality of data throughout multiple systems on both teams.今天，DataCop每天运行数百个测试，以确保两个团队中多个系统的数据质量。

Vlad Rișcuția is on LinkedIn.

Vlad Rișcuția在 LinkedIn上 。

[1] The Four V’s of Big Data, IBM, 2016.

[1] 大数据的四个V ，IBM，2016年。

[2] Data Sentinel: Automating Data Validation, LinkedIn, March 2010.

[2] 数据前哨：自动化数据验证，LinkedIn，2010年3月。

[3] Using Azure to Understand Azure, by Ron Sielinski, January 2020.

[3] Ron Sielinski于2020年1月使用 “ 使用Azure来理解Azure” 。

翻译自: https://medium.com/data-science-at-microsoft/partnering-for-data-quality-dc9123557f8b

数据质量提升

查看全文

http://www.taodudu.cc/news/show-994868.html

删除wallet里面登机牌_登机牌丢失问题
字符串操作截取后面的字符串_对字符串的5个必知的熊猫操作
数据科学家访谈录百度网盘_您应该在数据科学访谈中向THEM提问。
power bi函数_在Power BI中的行上使用聚合函数
大数定理中心极限定理_中心极限定理：直观的遍历
探索性数据分析（EDA）-不要问如何，不要问什么
安卓代码还是xml绘制页面_我们应该绘制实际还是预测，预测还是实际还是无关紧要？
云尚制片管理系统_电影制片厂的未来
t-sne原理解释_T-SNE解释-数学与直觉
js合并同类数组里面的对象_通过同类群组保留估算客户生命周期价值
com编程创建快捷方式中文_如何以编程方式为博客创建wordcloud？
基于plotly数据可视化_如何使用Plotly进行数据可视化
用Python创建漂亮的交互式可视化效果
php如何减缓gc_管理信息传播-使用数据科学减缓错误信息的传播
泰坦尼克号数据分析_第1部分：泰坦尼克号-数据分析基础
vba数组dim_NDArray — —一个基于Java的N-Dim数组工具包
python算法和数据结构_Python中的数据结构和算法
python dash_Dash是Databricks Spark后端的理想基于Python的前端
在Python中查找子字符串索引的5种方法
趣味数据故事_坏数据的好故事
python分句_Python循环中的分句，继续和其他子句
python数据建模数据集_Python中的数据集
usgs地震记录如何下载_用大叶草绘制USGS地震数据
数据可视化信息可视化_更好的数据可视化的8个技巧
sql 左联接全联接_通过了解自我联接将您SQL技能提升到一个新的水平
科学价值社交关系大数据_服务的价值：数据科学和用户体验研究美好生活
vs azure web_在Azure中迁移和自动化Chrome Web爬网程序的指南。
selenium 解析网页_用Selenium进行网页搜刮
hive 导入hdfs数据_将数据加载或导入运行在基于HDFS的数据湖之上的Hive表中的另一种方法。
大数据业务学习笔记_学习业务成为一名出色的数据科学家

数据质量提升_合作提高数据质量相关推荐

关于印发《广东省质量提升发展专项基金（省质量提升发展专项）管理细则》的通知
粤市监质发[2022]137号各地级以上市市场监管局,省市场监管局各处室.派出机构.直属单位,省药监局.省知识产权保护中心: <广东省质量提升发展专项基金(省质量提升发展专项)管理细则> ...
怎么评价两组数据是否接近_接近组数据（组间）
怎么评价两组数据是否接近接近组数据(组间) (Approaching group data (between-group)) A typical situation regarding solvin ...
数据集标注工具_如何提高数据标注质量，提供精细化标注数据集？丨曼孚科技...
监督学习下的深度学习算法训练十分依赖于标注数据,然而目前数据标注行业在精细化运营方面仍有诸多不足. 相关数据显示,当下数据标注行业单次交付达标率低于50%,三次内交付达标率低于90%,远远不能满足AI ...
大数据数据量估算_如何估算数据科学项目的数据收集成本
大数据数据量估算 (Notes: All opinions are my own) (注:所有观点均为我自己) 介绍 (Introduction) Data collection is the ini ...
大数据就业方向_如今大数据行业就业前景如何？
未来的时代将不是IT时代,而是DT的时代."阿里巴巴创始人马云不止在一个场合重复讲到.他这里所指的DT就是Data Technology数据科技.从2008在维克托·迈尔-舍恩伯格和肯尼斯· ...
hadloop大数据平台论文_企业大数据平台建设过程中的问题和建议
2 0 1 7 年第 1 2 期信息通信 2017 (总第 180 期) INFORMATION & COMMUNICATIONS ( Sum . N o 180) 企业大数据平台建 ...
成都python数据分析师职业技能_合格大数据分析师应该具备的技能
课程七.建模分析师之软技能 - 数据库技术本部分课程主要介绍MySQL数据库的安装使用及常用数据操作 1.关系型数据库介绍 2.MySQL的基本操作: 1)数据库的操作 2)数据表的操作 3)备份与 ...
大数据技术架构_架构大数据图
大数据管理数据处理过程图大数据(big data),指无法在一定时间范围内用常规软件工具进行捕捉.管理和处理的数据集合,是需要新处理模式才能具有更强的决策力.洞察力.大数据处理的主要流程包括数据收集 ...
查询数据库中数据的年份_本地公开数据中的年份
查询数据库中数据的年份由Alisha Green撰写全国各州和直辖市在开放数据方面又充满了令人鼓舞的消息. 从海岸到海岸,各种规模的市政当局都批准了新的开放数据政策,现有的开放数据计划已经成熟并引 ...

数据质量提升_合作提高数据质量

为什么要数据质量？ (Why data quality?)

DataCop的历史 (History of DataCop)

建筑 (Architecture)

M365与Azure之间的合作伙伴关系 (Partnership between M365 and Azure)

总结思想/总结 (Closing thoughts/summary)

相关文章：

数据质量提升_合作提高数据质量相关推荐

最新文章

热门文章