前置交换机数据交换

The DNC Data Science team builds and manages dozens of models that support a broad range of campaign activities. Campaigns rely on these model scores to optimize contactability, volunteer recruitment, get-out-the-vote, and many other pieces of modern campaigning. One of our responsibilities is to deliver the best available model scores in an accessible, actionable form.

DNC数据科学团队构建和管理数十种模型，以支持广泛的竞选活动。竞选活动依靠这些模型评分来优化联系能力，志愿者招募，投票表决和现代竞选活动的许多其他方面。我们的职责之一是以可访问，可操作的形式提供最佳的可用模型评分。

As part of Phoenix, the DNC’s data warehouse, we developed infrastructure that keeps our focus on delivering products to win elections instead of on ever-growing technical complexity. In this post, we’ll walk through the infrastructure that manages over 70 billion (and counting!) model scores for the country’s 200+ million registered voters.

作为DNC数据仓库Phoenix的一部分，我们开发了基础架构，使我们始终专注于交付赢得选举的产品，而不是不断增长的技术复杂性。在这篇文章中，我们将介绍为该国200亿以上注册选民管理的700亿(甚至更多)模型评分的基础架构。

挑战 (The Challenge)

At this point in the 2020 cycle, we’re managing about 20 different models. These models come from a mix of sources (our internal modeling infrastructure and multiple vendor syncs), and are a mix of regression, binary classification, and multi-class classification model types. Across those 20 models, we have around 80 distinct model versions that comprise more than 70 billion point estimates.

在2020年周期的这一点上，我们正在管理约20种不同的模型。这些模型来自多种来源(我们的内部建模基础结构和多个供应商同步)，并且是回归，二进制分类和多分类分类模型类型的混合。在这20个模型中，我们有大约80个不同的模型版本，其中包括超过700亿个点估计。

So, how do we get from the complexity of mixing so many model sources, model types, and versions-per-model to the clean, accessible set of scores our users can seamlessly integrate into their campaign programs?

那么，如何从混合这么多模型源，模型类型和每个模型版本的复杂性，到用户可以无缝地集成到他们的广告系列计划中的干净，可访问的分数集，如何变得复杂呢？

模型分数交换所 (A Clearinghouse for Model Scores)

Our solution is a pair of carefully designed tables for model versions and model scores, and an accompanying code base to cleanly manage model and score life-cycles.

我们的解决方案是为模型版本和模型评分精心设计的一对表格，以及用于干净地管理模型和评分生命周期的随附代码库。

Together, these tables are a clearinghouse for model score publication. Just as a financial clearinghouse ensures a clean exchange between parties in a transaction, our model score clearinghouse sits between a model’s source data and its downstream pipelines to ensure a clean hand-off from one to the other.

这些表格一起构成了模型评分发布的交换所 。就像金融票据交换所确保交易双方之间的干净交换一样，我们的模型分数票据交换所也位于模型的源数据及其下游管道之间，以确保从一个人到另一个人的彻底交接。

模型版本分类帐 (Model Version Ledger)

The first table of our clearinghouse, model_versions, keeps metadata on model versions. Vendor-sourced model version metadata is merged to this table as part of loading processes. Models we score in-house have their versions checked against and merged into this table as part of every scoring job. With many models and versions spread across our small data science team, we are thrilled to have this bookkeeping maintained programmatically.

我们的票据交换所的第一个表model_versions保留了模型版本的元数据。供应商来源的模型版本元数据在加载过程中会合并到此表中。我们内部评分的模型会对照其版本进行检查，并作为每次评分工作的一部分合并到此表中。我们的小型数据科学团队拥有许多模型和版本，我们很高兴以编程方式维护此簿记。

权威的模型预测表 (An Authoritative Table of Model Predictions)

The second table, scores, holds, well, scores. As we load incoming model scores, we tag both the model version and scoring job that generated them. For multi-class classification, we store scores in a normalized form and note the predicted class label in the score_name column.

第二张表， scores ，保持得分。加载传入的模型评分时，我们会同时标记模型版本和生成评分的评分工作。对于多类别分类，我们以标准化形式存储分数，并在score_name列中注明预测的类别标签。

A key part of this table’s architecture is that it’s partitioned by the model version’s date. This keeps all of the scores for a given model version on the same partition, allowing us to query just the few gigabytes of data for the scores we need instead of scanning the entire multi-terabyte table.

该表的体系结构的关键部分是按模型版本的日期进行分区。这样可以将给定模型版本的所有分数保留在同一分区上，从而使我们可以仅查询几GB数据以获得所需的分数，而不用扫描整个多TB的表。

Second table of our clearinghouse, scores.

Once the new scores have passed automated checks, their current_score_flag is flipped to TRUE and their datetime_approved field is set to the current timestamp. In the same step, previous scores of the same model_version that overlap with the new scores by external_id are flipped to FALSE and have their datetime_deprecated set to the current timestamp.

新分数通过自动检查后，它们的current_score_flag会翻转为TRUE并且datetime_approved字段会设置为当前时间戳。在相同的步骤，相同的分数以前model_version与新成绩通过重叠external_id翻转到FALSE ，并有自己的datetime_deprecated设置为当前的时间戳。

This operation makes the current_score_flag an authoritative marker for which scores are current within the model_run_id, a key assumption for when it’s time to materialize the scores for downstream use.

此操作使current_score_flag成为权威性标记，其分数在model_run_id内为当前分数，这是何时将分数具体化以供下游使用的关键假设。

最后-简化模型版本的操作 (Finally — Simplified Operations on Model Versions)

All of the bookkeeping in the tables above pays huge dividends when it comes time to query our model scores. This short script extracts the current scores for the current production model version of “My Model” with only the name of the model as an input!

上表中的所有簿记都是在查询我们的模型分数时要付出的巨大努力。这个简短的脚本仅使用模型名称作为输入来提取当前生产模型版本“My Model”的当前分数！

DECLARE CURRENT_MODEL_VERSION_DATE DATE;
DECLARE CURRENT_MODEL_RUN_ID STRING;-- check the model version ledger for the current model metadata
SET (CURRENT_MODEL_VERSION_DATE, CURRENT_MODEL_RUN_ID) = (SELECT AS STRUCT model_version_date, model_run_idFROM `modeling.model_versions`WHERE model_name = "My Model"AND current_model_flag is TRUE
);-- pull the model's current scores
SELECT external_id, score_name, score_value, datetime_approvedFROM `modeling.scores`WHERE model_version_date = CURRENT_MODEL_VERSION_DATEAND model_run_id = CURRENT_MODEL_RUN_IDAND current_score_flag is TRUE
;

我们如何使用它 (How We Use It)

Having a strong, consolidated schema for model versions and scores makes downstream use cases much cleaner and simpler. Here are a few examples of how this infrastructure is used in our work in the 2020 cycle:

具有用于模型版本和评分的强大，统一的架构，可以使下游用例更加简洁。以下是在2020年周期的工作中如何使用此基础架构的一些示例：

采购发布管道 (Sourcing Publishing Pipelines)

The most mission-critical use of this infrastructure is to serve the most current scores of each model to downstream pipelines in a consistent location and format. Our scoring and loading pipelines run a “materialize model” task that writes a query similar to the one above to a dataset holding one “current” table per model.

此基础架构最关键的用途是以一致的位置和格式将每个模型的最新分数提供给下游管道。我们的计分和加载流水线运行“物化模型”任务，该任务将与上面类似的查询写入到每个模型包含一个“当前”表的数据集中。

With our model and score version bookkeeping managed upstream, our downstream code can always find the model’s current scores in the same place. But more importantly, this approach insulates downstream processes from modeling issues: If a scoring job fails for some reason, or if newly loaded scores do not pass automated quality checks, the model version will not be re-materialized with the problematic scores.

通过在上游管理我们的模型和分数版本簿记，我们的下游代码始终可以在同一位置找到模型的当前分数。但更重要的是，这种方法将下游流程与建模问题隔离开来：如果计分工作由于某种原因而失败，或者如果新加载的分数未通过自动质量检查，则模型版本将不会与有问题的分数重新实现。

模型生命周期的面包屑 (Breadcrumbs for a Model’s Life Cycle)

With Election Day right around the corner, we have to respond quickly and confidently to potential problems with our models. We maintain a complete picture of a model and its scores by linking the model_run_id and scoring_run_id fields to the metadata and artifacts we store in an MLflow tracking instance.

即将到来的选举日，我们必须对我们模型的潜在问题做出Swift而自信的回应。通过将model_run_id和scoring_run_id字段链接到我们存储在MLflow跟踪实例中的元数据和工件，我们可以维护模型及其分数的完整图片。

This allows us to trace a published score back to its scoring job, its training job, and, through our other metadata systems, the exact training data that built the underlying model. This piece is critical for diagnosing issues and anomalies users encounter in the field.

这使我们可以将已发布的分数追溯到评分工作，培训工作，以及通过我们的其他元数据系统，来构建基础模型的确切培训数据。这对于诊断用户在现场遇到的问题和异常现象至关重要。

It’s also a gift to our future selves: when it’s time to revisit our models and improve their future versions, we’ll be working with a complete understanding of their inputs and outputs.

这也是我们未来自我的礼物：当需要重新审视我们的模型并改进其未来版本时，我们将全面了解其输入和输出。

模型还原 (Model Reversions)

Sometimes, we detect a problem with a model after it’s been published. When this happens, we can adjust the current_model_flag in the model_versions table to toggle a model version out of production, or simply suppress scores from a problematic scoring job. After adjusting the metadata in the clearinghouse, we can then re-materialize the model for downstream pipelines and address the lingering issues.

有时，我们会在模型发布后检测出问题。发生这种情况时，我们可以调整model_versions表中的current_model_flag来切换模型版本的停产状态，或仅抑制评分工作有问题的分数。在票据交换所中调整了元数据之后，我们可以为下游管道重新实现模型并解决长期存在的问题。

时间点快照 (Point-In-Time Snapshots)

Political data folks are fanatics for historical analysis, and model estimates can be a key part of that. With the datetime_approved and datetime_deprecated fields, we can reconstruct a model version’s ‘current’ scores from any point in time just as we materialize ‘current’ tables. This can give us a quick view into a model’s predictions over time, or even a snapshot of predictions as of a past Election Day.

政治数据人员是历史分析的狂热者，模型估计可能是其中的关键部分。使用datetime_approved和datetime_deprecated字段，我们可以在实现“当前”表的同时从任何时间点重建模型版本的“当前”分数。这可以使我们快速查看模型随时间变化的预测，甚至可以追溯到过去选举日的预测快照。

结论 (Conclusion)

Our investment in this infrastructure has allowed us to develop and deliver data science products that are more transparent and sustainable than ever before, and we’ll continue building on this infrastructure for the 2022 election cycle and beyond.

我们在基础设施方面的投资使我们能够开发和交付比以往任何时候都更加透明和可持续的数据科学产品，并且我们将在2022年选举周期及以后的基础设施上继续建设。

Interested in making a difference? Join our team.

有兴趣改变吗？加入我们的团队。

翻译自: https://medium.com/democratictech/our-data-science-clearinghouse-e9f12fd4a86

前置交换机数据交换

查看全文

http://www.taodudu.cc/news/show-997463.html

量子相干与量子纠缠_量子分类
知识力量_网络分析的力量
marlin 三角洲_带火花的三角洲湖：什么和为什么？
eda分析_EDA理论指南
简·雅各布斯指数第二部分：测试
抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系
如何开始使用任何类型的数据？ - 第1部分
机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类
COVID-19和世界幸福报告数据告诉我们什么？
lisp语言是最好的语言_Lisp可能不是数据科学的最佳语言，但是我们仍然可以从中学到什么呢？...
python pca主成分_超越“经典” PCA：功能主成分分析（FPCA）应用于使用Python的时间序列...
大数据平台构建_如何像产品一样构建数据平台
时间序列预测时间因果建模_时间序列建模以预测投资基金的回报
贝塞尔修正_贝塞尔修正背后的推理：n-1
php amazon-s3_推荐亚马逊电影-一种协作方法
简述yolo1-yolo3_使用YOLO框架进行对象检测的综合指南-第一部分
数据库:存储过程_数据科学过程：摘要
cnn对网络数据预处理_CNN中的数据预处理和网络构建
消解原理推理_什么是推理统计中的Z检验及其工作原理？
大学生信息安全_给大学生的信息
特斯拉最安全的车_特斯拉现在是最受欢迎的租车选择
ml dl el学习_DeepChem —在生命科学和化学信息学中使用ML和DL的框架
用户参与度与活跃度的区别_用户参与度突然下降
数据草拟：使您的团队热爱数据的研讨会
c++ 时间序列工具包_我的时间序列工具包
adobe 书签怎么设置_让我们设置一些规则…没有Adobe Analytics处理规则
分类预测回归预测_我们应该如何汇总分类预测？
神经网络推理_分析神经网络推理性能的新工具
27个机器学习图表翻译_使用机器学习的信息图表信息组织
面向Tableau开发人员的Python简要介绍（第4部分）

前置交换机数据交换_我们的数据科学交换所相关推荐

深度学习数据自动编码器_如何学习数据科学编码
深度学习数据自动编码器意见 (Opinion) When I first wanted to learn programming, I coded along to a 4 hour long Yo ...
excel导入数据校验_使用Excel数据验证限制日期范围
excel导入数据校验 Yesterday, one of my clients emailed to let me know that she was having trouble entering ...
深度学习数据更换背景_开始学习数据科学的最佳方法是了解其背景
深度学习数据更换背景数据科学教育 (DATA SCIENCE EDUCATION) 目录 (Table of Contents) The Importance of Context Knowledg ...
查询数据库中有多少个数据表_您的数据中有多少汁？
查询数据库中有多少个数据表 97%. That's the percentage of data that sits unused by organizations according to Gart ...
用forall的save exceptions机制高效处理数据交换中的异常数据
在一个数据交换场景中,对方提供一个远程数据库,我方根据时间戳提取增量数据,经转换处理后存到我方数据库的表中. 对方数据有以下特征: 1.增量数据数量较大: 2.数据不规范,部分数据无法直接写入我方数据 ...
matlab与quartus的联合数据交换(NCO与文件数据的混频处理)
文章目录背景再次认识关于DDS的来源实际案例官方资料阅读(NCO IP core) 参数原理通常的步骤工程实例 MATLAB生成波形txt文件 IP配置文件命令语法(官方提示) mode ...
rdd数据存内存数据量_「大数据」（七十七） Spark之IO机制
[导读:数据是二十一世纪的石油,蕴含巨大价值,这是·情报通·大数据技术系列第[77]篇文章,欢迎阅读和收藏] 1 基本概念与传统的 IO 相比, Spark IO 有很大区别.传统的数据存在单个计算 ...
python查看数据大小_科多大数据带你看Python可以列为最值得学习的编程语言
原标题:科多大数据带你看Python可以列为最值得学习的编程语言不知道从什么时候开始,这句话开始流行.不过也从侧面反映出 Python 语言的特点:简单.高效. 从近期代表技术趋势的业界报告以及编程 ...
python爬取淘宝数据魔方_淘宝数据魔方技术架构解析
淘宝网拥有国内最具商业价值的海量数据.截至当前,每天有超过30亿的店铺.商品浏览记录,10亿在线商品数,上千万的成交.收藏和评价数据.如何从这些数据中挖掘出真正的商业价值,进而帮助淘宝.商家进行企业 ...

前置交换机数据交换_我们的数据科学交换所