华为开源构建工具

I’ve developed an open-source data testing and a quality tool called data-flare. It aims to help data engineers and data scientists assure the data quality of large datasets using Spark. In this post I’ll share why I wrote this tool, why the existing tools weren’t enough, and how this tool may be helpful to you.

我已经开发了一个开源数据测试和一个称为data-flare的质量工具。它旨在帮助数据工程师和数据科学家使用Spark确保大型数据集的数据质量。在这篇文章中，我将分享我编写此工具的原因，为什么现有工具不够用以及该工具如何为您提供帮助。

谁晚上花时间编写数据质量工具？ (Who spends their evenings writing a data quality tool?)

In every data-driven organisation, we must always recognise that without confidence in the quality of our data, that data is useless. Despite that there are relatively few tools available to help us ensure our data quality stays high.

在每个由数据驱动的组织中，我们必须始终认识到，对数据质量没有信心，数据就毫无用处。尽管有相对较少的工具可用来帮助我们确保数据质量保持较高水平。

What I was looking for was a tool that:

我一直在寻找的工具是：

Helped me write high performance checks on the key properties of my data, like the size of my datasets, the percentage of rows that comply with a condition, or the distinct values in my columns帮助我对数据的关键属性进行高性能检查，例如数据集的大小，符合条件的行的百分比或列中的不同值
Helped me track those key properties over time, so that I can see how my datasets are evolving, and spot problem areas easily帮助我随着时间的推移跟踪这些关键属性，以便我可以查看数据集的发展情况，并轻松发现问题区域
Enabled me to write more complex checks to check other facets of my data that weren’t simple to incorporate in a property, and enabled me to compare between different datasets使我能够编写更复杂的检查来检查我的数据的其他方面，这些方面并非很容易合并到属性中，并使我能够在不同的数据集之间进行比较
Would scale to huge volumes of data可以扩展到海量数据

The tools that I found were more limited, constraining me to simpler checks defined in yaml or json, or only letting me check simpler properties on a single dataset. I wrote data-flare to fill in these gaps, and provide a one-stop-shop for our data quality needs.

我发现的工具受到限制，使我只能使用yaml或json中定义的更简单的检查，或者只能让我检查单个数据集上的更简单的属性。我写了数据耀斑来填补这些空白，并为我们的数据质量需求提供一站式服务。

给我看代码 (Show me the code)

data-flare is a Scala library built on top of Spark. It means you will need to write some Scala, but I’ve tried to keep the interface simple so that even a non-Scala developer could quickly pick it up.

data-flare是一个基于Spark构建的Scala库。这意味着您将需要编写一些Scala，但是我试图使界面保持简单，以便即使是非Scala开发人员也可以快速使用它。

Let’s look at a simple example. Imagine we have a dataset containing orders, with the following attributes:

让我们看一个简单的例子。想象一下，我们有一个包含订单的数据集，具有以下属性：

CustomerId顾客ID
OrderIdOrderId
ItemIdItemId
OrderType订单类型
OrderValue订单价值

We can represent this in a Dataset[Order] in Spark, with our order being:

我们可以在Spark的Dataset [Order]中表示它，其顺序为：

case class Order(customerId: String, orderId: String, itemId: String, orderType: String, orderValue: Int)

检查单个数据集 (Checks on a single dataset)

We want to check that our orders are all in order, including checking:

我们要检查我们的订单是否全部正常，包括检查：

orderType is “Sale” at least 90% of the timeorderType至少有90％的时间为“销售”
orderTypes of “Refund” have order values of less than 0“退款”的orderType类型的订单值小于0
There are 20 different items that we sell, and we expect orders for each of those我们出售20种不同的商品，我们希望每个商品都有订单
We have at least 100 orders我们至少有100个订单

We can do this as follows (here orders represents our Dataset[Order]):

我们可以按照以下步骤进行操作(这里的order代表我们的Dataset [Order])：

val ordersChecks = ChecksSuite("orders",singleDsChecks = Map(DescribedDs(orders, "orders") -> Seq(SingleMetricCheck.complianceCheck(AbsoluteThreshold(0.9, 1),ComplianceFn(col("orderType") === "Sale")),SingleMetricCheck.complianceCheck(AbsoluteThreshold(1, 1),ComplianceFn(col("orderValue") < 0),MetricFilter(col("orderType") === "Refund")),SingleMetricCheck.distinctValuesCheck(AbsoluteThreshold(Some(20), None),List("itemId")),SingleMetricCheck.sizeCheck(AbsoluteThreshold(Some(100), None)))))

As you can see from this code, everything starts with a ChecksSuite. You can then pass in all of your checks that operate on single datasets using the singleDsChecks. We’ve been able to do all of these checks using SingleMetricChecks — these are efficient and perform all checks in a single pass over the dataset.

从该代码可以看到，所有内容都以ChecksSuite开头。然后，您可以使用singleDsChecks传递对单个数据集进行的所有检查。我们已经能够使用SingleMetricChecks进行所有这些检查-这些效率很高，并且可以一次通过数据集来执行所有检查。

What if we wanted to do something that we couldn’t easily express with a metric check? Let’s say we wanted to check that no customer had more than 5 orders with an orderType of “Flash Sale”. We could express that with an Arbitrary Check like so:

如果我们想做一些无法通过度量标准检查轻易表达的事情怎么办？假设我们要检查的是，没有任何客户的orderType为“ Flash Sale”的订单超过5个。我们可以这样用任意支票来表示：

ArbSingleDsCheck("less than 5 flash sales per customer") { ds =>val tooManyFlashSaleCustomerCount = ds.filter(col("orderType") === "Flash Sale").groupBy("customerId").agg(count("orderId").as("flashSaleCount")).filter(col("flashSaleCount") > 5).countif (tooManyFlashSaleCustomerCount > 0)RawCheckResult(CheckStatus.Error, s"$tooManyFlashSaleCustomerCount customers had too many flash sales")elseRawCheckResult(CheckStatus.Success, "No customers had more than 5 flash sales :)")
}

The ability to define arbitrary checks in this way gives you the power to define any check you want. They won’t be as efficient as the metric based checks, but the flexibility you get can make it a worthwhile trade-off.

以这种方式定义任意检查的能力使您能够定义所需的任何检查。它们不会像基于指标的检查那样高效，但是您获得的灵活性可以使其成为一个有价值的折衷。

检查一对数据集 (Checks on a pair of datasets)

Let’s imagine we have a machine learning algorithm that predicts which item each customer will order next. We are returned another Dataset[Order] with predicted orders in it.

假设我们有一个机器学习算法，可以预测每个客户接下来要订购的商品。我们返回了另一个带有预测订单的Dataset [Order]。

We may want to compare metrics on our predicted orders with metrics on our original orders. Let’s say that we expect to have an entry in our predicted orders for every customer that has had a previous order. We could check this using Flare as follows:

我们可能希望将预测订单的指标与原始订单的指标进行比较。假设我们希望在每个先前拥有订单的客户的预测订单中都有一个条目。我们可以使用Flare对此进行检查，如下所示：

val predictedOrdersChecks = ChecksSuite("orders",dualDsChecks = Map(DescribedDsPair(DescribedDs(orders, "orders"), DescribedDs(predictedOrders, "predictedOrders")) ->Seq(DualMetricCheck(CountDistinctValuesMetric(List("customerId")), CountDistinctValuesMetric(List("customerId")),"predicted orders present for every customer", MetricComparator.metricsAreEqual)))
)

We can pass in dualDsChecks to a ChecksSuite. Here we describe the datasets we want to compare, the metrics we want to calculate for each of those datasets, and a MetricComparator which describes how those metrics should be compared. In this case we want the number of distinct customerIds in each dataset to be equal.

我们可以将dualDsChecks传递给ChecksSuite。在这里，我们描述了我们要比较的数据集，我们要为每个数据集计算的指标，以及描述如何比较这些指标的MetricComparator。在这种情况下，我们希望每个数据集中不同的customerId数量相等。

运行支票时会怎样？ (What happens when you run your checks?)

When you run your checks all metrics are calculated in a single pass over each dataset, and check results are calculated and returned. You can then decide yourself how to handle those results. For example if one of your checks gives an error you could fail the spark job, or send a failure notification.

运行检查时，所有指标都将通过一次遍历每个数据集进行计算，并计算并返回检查结果。然后，您可以决定自己如何处理这些结果。例如，如果您的一项检查出现错误，则可能导致Spark作业失败或发送失败通知。

你还能做什么？ (What else can you do?)

Store your metrics and check results by passing in a metricsPersister and qcResultsRepository to your ChecksSuite (ElasticSearch supported out the box, and it’s extendable to support any data store)通过将metricsPersister和qcResultsRepository传递到您的ChecksSuite中来存储指标并检查结果(ElasticSearch支持开箱即用，并且可扩展以支持任何数据存储)
Graph metrics over time in Kibana so you can spot trends在Kibana中随时间绘制指标图表，以便您发现趋势
Write arbitrary checks for pairs of datasets为数据集对编写任意检查

For more information check out the documentation and the code!

有关更多信息，请查阅文档和代码！

翻译自: https://medium.com/swlh/why-i-built-an-opensource-tool-for-big-data-testing-and-quality-control-182a14701e8d

华为开源构建工具

查看全文

http://www.taodudu.cc/news/show-997350.html

数据科学项目_完整的数据科学组合项目
uni-app清理缓存数据_数据清理-从哪里开始？
bigquery_如何在BigQuery中进行文本相似性搜索和文档聚类
vlookup match_INDEX-MATCH — VLOOKUP功能的升级
flask redis_在Flask应用程序中将Redis队列用于异步任务
前馈神经网络中的前馈_前馈神经网络在基于趋势的交易中的有效性（1）
hadoop将消亡_数据科学家：适应还是消亡！
数据科学领域有哪些技术_领域知识在数据科学中到底有多重要？
初创公司怎么做销售数据分析_为什么您的初创企业需要数据科学来解决这一危机...
r软件时间序列分析论文_高度比较的时间序列分析-一篇论文评论
selenium抓取_使用Selenium的网络抓取电子商务网站
裁判打分_内在的裁判偏见
从Jupyter Notebook切换到脚本的5个理由
ip登录打印机怎么打印_不要打印，登录。
机器学习模型非线性模型_调试机器学习模型的终极指南
您的第一个简单的机器学习项目
鸽子为什么喜欢盘旋_如何为鸽子回避系统设置数据收集
追求卓越追求完美规范学习_追求新的黄金比例
周末想找个地方敲代码_观看我们的代码游戏，全周末直播
javascript 开发_25个新JavaScript开发人员的免费资源
感谢您的提问_感谢您的反馈，我们正在改进的5种方法
堆叠自编码器中的微调解释_25种深刻漫画中的编码解释
Free Code Camp现在有本地组
递归javascript_JavaScript中的递归
判断一个指针有没有free_Free Code Camp的每个人现在都有一个档案袋
使您的Java代码闻起来很新鲜
Stack Overflow 2016年对50,000名开发人员进行的调查得出的见解
编程程序的名称要记住吗_学习编程时要记住的5件事
如何在开源社区贡献代码_如何在15分钟内从浏览器获得您的第一个开源贡献
utf-8转换gbk代码_将代码转换为现金-如何以Web开发人员的身份赚钱并讲述故事。...

华为开源构建工具_为什么我构建了用于大数据测试和质量控制的开源工具相关推荐

sql自动生成工具_可自动生成代码，5款基于AI的开发工具
如今,对机器学习潜力感兴趣的程序员都在讨论,如何使用人工智能和基于人工智能的软件开发工具构建应用程序.例如PyTorch和TensorFlow之类的解决方案. 除此之外,机器学习技术正以另一种有趣的方 ...
java 开发人员工具_每个Java开发人员都应该知道的10个基本工具
java 开发人员工具大家好,我们已经到了2019年的第二个月,我相信你们所有人都已经制定了关于2019年学习以及如何实现这些目标的目标. 我一直在撰写一系列文章,为您提供一些知识,使您可以学习和改 ...
python亚马逊运营工具_使用亚马逊云服务必备的八款SaaS工具
原标题:使用亚马逊云服务必备的八款SaaS工具这些年做项目的过程中收集了相当多的工具和服务来简化开发者.系统管理员以及DevOps的日常工作. 基本上所有的PHP.Python或者Ruby开发者都与 ...
cantata测试工具_我如何构建和维护开源音乐播放器Cantata
cantata测试工具这是与开发和维护开源音乐播放器的开发人员进行的一系列对话的第三部分. Craig Drummond是Cantata的开发者和维护者, Cantata是一种开源音乐播放器,充当M ...
react中使用构建缓存_使用React构建Tesla的电池范围计算器（第1部分）
react中使用构建缓存 by Matthew Choi 由Matthew Choi 使用React构建Tesla的电池范围计算器(第1部分) (Building Tesla's Battery Ra ...
使用python构建向量空间_使用Docker构建Python数据科学容器
人工智能(AI)和机器学习(ML)最近真的火了,并驱动了从自动驾驶汽车到药物发现等等应用领域的快速发展.AI和ML的前途一片光明. 另一方面,Docker通过引入临时轻量级容器彻底改变了计算世界.通过 ...
cmake 构建路径_基于CMake构建系统的MLIR Example扩展
上一篇文章讲了把pybind11的示例程序嵌入到了MLIR的Example中,但是在构建的过程中有一定运气成分,并不知道具体是怎么通过CMake构建系统编译出的共享库文件.在分析了MLIR各层级的CM ...
msbuild构建步骤_使用并行构建和多核CPU的MSBuild进行更快的构建
msbuild构建步骤 UPDATE: I've written an UPDATE on how to get MSBuild building using multiple cores from ...
使用python构建数据库_使用Python构建（半）自主无人机
使用python构建数据库 They might not be delivering our mail (or our burritos) yet, but drones are now simple ...

华为开源构建工具_为什么我构建了用于大数据测试和质量控制的开源工具