netflix

Jeffrey Wong, Colin McFarland

杰弗里·黄 科林·麦克法兰

Every Netflix data scientist, whether their background is from biology, psychology, physics, economics, math, statistics, or biostatistics, has made meaningful contributions to the way Netflix analyzes causal effects. Scientists from these fields have made many advancements in causal effects research in the past few decades, spanning instrumental variables, forest methods, heterogeneous effects, time-dynamic effects, quantile effects, and much more. These methods can provide rich information for decision making, such as in experimentation platforms (“XP”) or in algorithmic policy engines.

每个Netflix数据科学家,无论其背景是生物学,心理学,物理学,经济学,数学,统计学还是生物统计学,都对Netflix分析因果关系的方式做出了有意义的贡献。 在过去的几十年中,这些领域的科学家在因果效应研究方面取得了许多进步,涵盖了工具变量,森林方法,非均质效应,时间动态效应,分位数效应等等。 这些方法可以为决策提供丰富的信息,例如在实验平台(“ XP”)或算法策略引擎中。

We want to amplify the effectiveness of our researchers by providing them software that can estimate causal effects models efficiently, and can integrate causal effects into large engineering systems. This can be challenging when algorithms for causal effects need to fit a model, condition on context and possible actions to take, score the response variable, and compute differences between counterfactuals. Computation can explode and become overwhelming when this is done with large datasets, with high dimensional features, with many possible actions to choose from, and with many responses. In order to gain broad software integration of causal effects models, a significant investment in software engineering, especially in computation, is needed. To address the challenges, Netflix has been building an interdisciplinary field across causal inference, algorithm design, and numerical computing, which we now want to share with the rest of the industry as computational causal inference (CompCI). A whitepaper detailing the field can be found here.

我们希望通过提供能够有效估计因果关系模型并将因果关系整合到大型工程系统中的软件来扩大研究人员的效率。 当因果效应算法需要适合模型,根据情况和采取的可能措施,对响应变量进行评分以及计算反事实之间的差异时,这可能会具有挑战性。 当使用大型数据集,具有高维特征,有很多可能的动作可供选择以及有很多响应时,计算可能会爆炸并变得不堪重负。 为了获得因果模型的广泛软件集成,需要在软件工程上,特别是在计算上进行大量投资。 为了应对这些挑战,Netflix一直在跨因果推理,算法设计和数值计算领域建立跨学科领域,我们现在希望将其作为计算因果推理 (CompCI)与业界其他人士共享。 可以在此处找到详细说明该领域的白皮书。

Computational causal inference brings a software implementation focus to causal inference, especially in regards to high performance numerical computing. We are implementing several algorithms to be highly performant, with a low memory footprint. As an example, our XP is pivoting away from two sample t-tests to models that estimate average effects, heterogeneous effects, and time-dynamic treatment effects. These effects help the business understand the user base, different segments in the user base, and whether there are trends in segments over time. We also take advantage of user covariates throughout these models in order to increase statistical power. While this rich analysis helps to inform business strategy and increase member joy, the volume of the data demands large amounts of memory, and the estimation of the causal effects on such volume of data is computationally heavy.

计算因果推理将软件实现重点放在因果推理上,尤其是在高性能数值计算方面。 我们正在实现几种算法,以实现高性能,低内存占用。 例如,我们的XP正在从两个样本t检验转向使用估计平均效果,异构效果和时间动态处理效果的模型。 这些效果有助于企业了解用户群,用户群中的不同细分以及细分随时间的变化趋势。 我们还利用这些模型中的用户协变量来提高统计能力。 尽管这种丰富的分析有助于告知业务策略并增加成员的满意度,但数据量需要大量的内存,并且对这种数据量的因果效应的估计在计算上很繁琐。

In the past, the computations for covariate adjusted heterogeneous effects and time-dynamic effects were slow, memory heavy, hard to debug, a large source of engineering risk, and ultimately could not scale to many large experiments. Using optimizations from CompCI, we can estimate hundreds of conditional average effects and their variances on a dataset with 10 million observations in 10 seconds, on a single machine. In the extreme, we can also analyze conditional time dynamic treatment effects for hundreds of millions of observations on a single machine in less than one hour. To achieve this, we leverage a software stack that is completely optimized for sparse linear algebra, a lossless data compression strategy that can reduce data volume, and mathematical formulas that are optimized specifically for estimating causal effects. We also optimize for memory and data alignment.

过去,协变量调整后的异构效应和时动态效应的计算速度慢,内存繁重,难以调试,工程风险很大,最终无法扩展到许多大型实验。 使用CompCI的优化,我们可以在一台机器上用10秒钟内进行1000万次观测的数据集上估计数百个条件平均效果及其方差。 在极端情况下,我们还可以在不到一小时的时间内对一台机器上的亿万个观测值进行条件时间动态处理效果分析。 为了实现这一目标,我们利用了针对稀疏线性代数进行了完全优化的软件堆栈,可以减少数据量的无损数据压缩策略以及专门用于估计因果关系的数学公式。 我们还针对内存和数据对齐进行了优化。

This level of computing affords us a lot of luxury. First, the ability to scale complex models means we can deliver rich insights for the business. Second, being able to analyze large datasets for causal effects in seconds increases research agility. Third, analyzing data on a single machine makes debugging easy. Finally, the scalability makes computation for large engineering systems tractable, reducing engineering risk.

这种级别的计算为我们提供了很多奢侈。 首先,扩展复杂模型的能力意味着我们可以为企业提供丰富的见解。 其次,能够在几秒钟内分析大型数据集的因果关系,从而提高了研究敏捷性。 第三,在一台机器上分析数据使调试变得容易。 最后,可伸缩性使大型工程系统的计算变得容易处理,从而降低了工程风险。

Computational causal inference is a new, interdisciplinary field we are announcing because we want to build it collectively with the broader community of experimenters, researchers, and software engineers. The integration of causal inference into engineering systems can lead to large amounts of new innovation. Being an interdisciplinary field, it truly requires the community of local, domain experts to unite. We have released a whitepaper to begin the discussion. There, we describe the rising demand for scalable causal inference in research and in software engineering systems. Then, we describe the state of common causal effects models. Afterwards, we describe what we believe can be a good software framework for estimating and optimizing for causal effects.

计算因果推理是我们宣布的一个新的跨学科领域,因为我们希望与更广泛的实验人员,研究人员和软件工程师共同构建该因果推理。 将因果推理集成到工程系统中可以导致大量新的创新。 作为一个跨学科领域,它确实需要本地领域专家的社区团结。 我们发布了一份白皮书来开始讨论。 在这里,我们描述了在研究和软件工程系统中对可伸缩因果推理的不断增长的需求。 然后,我们描述了常见因果模型的状态。 然后,我们描述我们认为可以成为评估和优化因果关系的良好软件框架。

Finally, we close the CompCI whitepaper with a series of open challenges that we believe require an interdisciplinary collaboration, and can unite the community around. For example:

最后,我们以一系列公开挑战结束了CompCI白皮书,我们认为这需要跨学科合作,并且可以团结社区。 例如:

  1. Time dynamic treatment effects are notoriously hard to scale. They require a panel of repeated observations, which generate large datasets. They also contain autocorrelation, creating complications for estimating the variance of the causal effect. How can we make the computation for the time-dynamic treatment effect, and its distribution, more scalable?众所周知,时间动态治疗效果很难扩展。 他们需要一组重复的观察结果,从而生成大型数据集。 它们还包含自相关,从而产生了复杂的估计因果效应的方差。 我们如何使时间动态治疗效果及其分布的计算更具可扩展性?
  2. In machine learning, specifying a loss function and optimizing it using numerical methods allows a developer to interact with a single, umbrella framework that can span several models. Can such an umbrella framework exist to specify different causal effects models in a unified way? For example, could it be done through the generalized method of moments? Can it be computationally tractable?在机器学习中,指定损失函数并使用数值方法对其进行优化,使开发人员可以与可以跨多个模型的单个伞形框架进行交互。 是否可以使用这样的伞形框架以统一的方式指定不同的因果模型? 例如,可以通过广义矩方法来完成吗? 它在计算上可以处理吗?
  3. How should we develop software that understands if a causal parameter is identified? A solution to this helps to create software that is safe to use, and can provide safe, programmatic access to the analysis of causal effects. We believe there are many edge cases in identification that require an interdisciplinary group to solve.我们应该如何开发能够识别因果参数的软件? 解决此问题的方法有助于创建安全使用的软件,并可以安全,编程地访问因果关系分析。 我们认为,鉴定中存在许多需要跨学科小组解决的边缘案例。

We hope this begins the discussion, and over the coming months we will be sharing more on the research we have done to make estimation of causal effects performant. There are still many more challenges in the field that are not listed here. We want to form a community spanning experimenters, researchers, and software engineers to learn about problems and solutions together. If you are interested in being part of this community, please reach us at compci-public@netflix.com.

我们希望这能开始讨论,在接下来的几个月中,我们将分享更多有关所做的研究以评估绩效因果关系。 该领域中还有许多其他挑战未在此处列出。 我们希望形成一个由实验人员,研究人员和软件工程师组成的社区,以共同了解问题和解决方案。 如果您有兴趣加入这个社区,请通过compci-public@netflix.com与我们联系。

翻译自: https://netflixtechblog.com/computational-causal-inference-at-netflix-293591691c62

netflix


http://www.taodudu.cc/news/show-995026.html

相关文章:

  • 高斯金字塔 拉普拉斯金字塔_金字塔学入门指南
  • 语言认知偏差_我们的认知偏差正在破坏患者的结果数据
  • python中定义数据结构_Python中的数据结构。
  • plotly django_使用Plotly为Django HTML页面进行漂亮的可视化
  • 软件工程方法学要素含义_日期时间数据的要素工程
  • 数据湖 data lake_在Data Lake中高效更新TB级数据的模式
  • ai对话机器人实现方案_显然地引入了AI —无代码机器学习解决方案
  • 图片中的暖色或冷色滤色片是否会带来更多点击? —机器学习A / B测试
  • 图卷积 节点分类_在节点分类任务上训练图卷积网络
  • 回归分析预测_使用回归分析预测心脏病。
  • aws spark_使用Spark构建AWS数据湖时的一些问题以及如何处理这些问题
  • 数据科学家编程能力需要多好_我们不需要这么多的数据科学家
  • sql优化技巧_使用这些查询优化技巧成为SQL向导
  • 物种分布模型_减少物种分布建模中的空间自相关
  • 清洁数据ploy n_清洁屋数据
  • 基于边缘计算的实时绩效_基于绩效的营销中的三大错误
  • 上凸包和下凸包_使用凸包聚类
  • 决策树有框架吗_决策框架
  • mysql那本书适合初学者_3本书适合初学者
  • 阎焱多少身价_2020年,数据科学家的身价是多少?
  • 卡尔曼滤波滤波方程_了解卡尔曼滤波器及其方程
  • 朴素贝叶斯分类器 文本分类_构建灾难响应的文本分类器
  • Seaborn:Python
  • 销货清单数据_2020年8月数据科学阅读清单
  • 米其林餐厅 盐之花_在世界范围内探索《米其林指南》
  • spotify 数据分析_我的Spotify流历史分析
  • 纹个鸡儿天才小熊猫_给熊猫用户的5个提示
  • 图像离群值_什么是离群值?
  • 数据预处理工具_数据预处理
  • 自考数据结构和数据结构导论_我跳过大学自学数据科学

netflix_Netflix的计算因果推论相关推荐

  1. 新托福写作:因果推论法

    新托福写作:对比类题型写法 新托福写作:单一观点类题型写法 新托福写作:高分词句 新托福写作:绝对化题型写作 新托福写作:三选一题型写法 新托福写作:例证法 新托福写作:活用简单词 新托福写作:加分词 ...

  2. 扩散模型的启发和因果推论之数据增强

  3. 面向Tableau开发人员的Python简要介绍(第2部分)

    用PYTHON探索数据 (EXPLORING DATA WITH PYTHON) And we're back! Let's pick up where we left off in the firs ...

  4. 您一直在寻找5+个简单的一线工具来提升Python可视化效果

    Insightful and aesthetic visualizations don't have to be a pain to create. This article will prevent ...

  5. 数据分析 绩效_如何在绩效改善中使用数据分析

    数据分析 绩效 Imagine you need to do a bank transaction, but the website is so slow. The page takes so muc ...

  6. 熊猫分发_熊猫新手:第二部分

    熊猫分发 This article is a continuation of a previous article which kick-started the journey to learning ...

  7. python 交互式流程图_使用Python创建漂亮的交互式和弦图

    python 交互式流程图 Python中的数据可视化 (Data Visualization in Python) R vs Python is a constant tussle when it ...

  8. 水文分析提取河网_基于图的河网段地理信息分析排序算法

    水文分析提取河网 The topic of this article is the application of information technologies in environmental s ...

  9. 队列的链式存储结构及其实现_了解队列数据结构及其实现

    队列的链式存储结构及其实现 A queue is a collection of items whereby its operations work in a FIFO - First In Firs ...

最新文章

  1. CSP认证201604-1 折点计数[C++题解]:枚举、遍历
  2. centos7配置 console口_玩转KVM-一招打开vm的console口
  3. 13个初中级Python程序员练习的项目开发实战
  4. 操作系统(十三)处理机调度的概念、层次
  5. osi七层模型tcp/udp
  6. SAP中国招聘内部顾问,工作职责是做客户项目,ABAP开发
  7. 粒子系统(一):从零开始画一颗树
  8. js表单验证,给出友好的提示
  9. InfoComm China 2019,揭秘“NVIDIA风格”数据科学!
  10. Laravel中自定义guard,自定义Auth的attempt方法
  11. as常用固定搭配_人民日报整理:英语常用介词固定搭配,太实用了!
  12. 基于Python3-Pygame的坦克大战小游戏
  13. 期刊第8期 | 分享C/C++嵌入式系统编程思想
  14. python_爬取博客文章下载到本地
  15. FreeXploiT HTML(Hacker Technology Mad Lead)
  16. TINA导入Ti官网器件
  17. 关于HTML的相关标签
  18. 中国电信中兴F412光猫——IPTV与网络单线复用
  19. 跟我学AngularJs:AngulaJs开发技巧汇总(持续更新)
  20. PHP二级域名分发网站源码商业版全开源

热门文章

  1. c++中的set容器和multiset容器
  2. 1091 N-自守数 (15 分)
  3. 被面试官问的Android问题难倒了,成功入职字节跳动
  4. 不显示调用super_让不懂编程的人爱上iPhone开发(2017秋iOS11+Swift4+Xcode9版)-第11篇
  5. 解决idea 中web项目无法正常显示的问题
  6. [转]Excel数据转化为sql脚本
  7. Android App 的主角:Activity
  8. 【loj6191】「美团 CodeM 复赛」配对游戏 概率期望dp
  9. Docker - 避免启动container后运行shell脚本执行完成后docker退出container
  10. C#心得与经验(二)