石灰吟思维导图视频

Inaccurate experimentation analysis leads to suboptimal business decisions. As Lime grows, the importance of standardized experiment analysis increases commensurately— with hundred of thousands more riders on our platform today than two years ago, the wrong business decision now can impact thousands more riders than it could before. In order to be agile with our testing while maintaining rigor in our experiment analysis, we built an experimentation handbook for teams to reference. The Handbook not only consists of guidance on how to best set up and analyze tests, but has also created a standard set of methods for the Data Science and Analytics team to employ. This blog post includes some sections from our Handbook, which is organized into Pre-Test (test setup), During Test (checking in), and After Test (results) sections.

不正确的实验分析会导致次优的业务决策。 随着Lime的增长,标准化实验分析的重要性也相应增加-与两年前相比,如今我们平台上的成千上万的骑手,现在错误的业务决策可能比以前影响成千上万的骑手。 为了在保持严格的实验分析的同时灵活地进行测试,我们制作了实验手册,供团队参考。 该手册不仅包含有关如何最佳设置和分析测试的指南,而且还为数据科学和分析团队创建了一套标准方法供其使用。 此博客文章包含手册中的某些部分,分为“测试前”(测试设置),“测试中”(检入)和“测试后”(结果)两部分。

预测试(设置) (Pre-Test (Set Up))

选择实验设置 (Picking an Experimentation Setup)

When determining which experiment setup to use at Lime, there are a few key factors to consider.

确定在Lime上使用哪种实验设置时,需要考虑一些关键因素。

1) What kind of test are we running (i.e. what are we randomizing on)?

1)我们正在运行什么样的测试(即我们随机进行什么测试)?

  • Rider Tests: Tests affecting the Rider experience such as UI changes or new promotions are Rider Tests with KPI metrics like Rider Retention. These tests are normally randomized on Riders.

    车手测试:影响车手体验的测试(例如UI更改或新促销)是具有KPI指标的车手测试,例如车手保留率。 这些测试通常在Riders上随机进行。

  • Deployment Tests: Tests that affect deployment strategies, such as changes to Scooter Hotspots (if permitted to do so), are deployment tests with KPI metrics like trip start conversion. These tests are generally randomized on Scooters.

    部署测试:影响部署策略的测试,例如对Scooter Hotspot的更改(如果允许的话),是具有KPI指标(例如,行程开始转换)的部署测试。 这些测试通常在踏板车上随机进行。

  • Hardware Tests: Tests on the Scooter hardware or firmware itself, such as an improvement to scooter parts, are generally randomized on scooters.

    硬件测试 :对踏板车硬件或固件本身的测试,例如对踏板车零件的改进,通常在踏板车上随机进行。

2) Do we anticipate the treatment to have network effects?

2)我们预期这种治疗会产生网络效应吗?

After determining whether our treatment should be tested via a Rider, Deployment, or Hardware test, we then need to consider whether our test treatment will have network effects, also known as “interference bias”. (As a quick summary, “network effects” occur when the change in behavior for users in the treatment group also ends up affecting the users in the control group, thereby biasing the results. As an example, if we made Lime scooters free for treatment group users, then likely scooters rides would increase enough such that the users in the control group would have a harder than normal time finding an available scooter to ride, which would make the control group no longer acting at the “normal” baseline.)

在确定我们的治疗方法是否应该通过骑士,部署或硬件测试进行测试之后,我们需要考虑我们的测试治疗方法是否会产生网络效应,也称为“干扰偏差”。 (作为一个简短的摘要,当治疗组用户的行为改变最终也会影响对照组用户时,就会产生“网络效应”,从而使结果产生偏差。例如,如果我们免费给石灰踏板车提供治疗组用户,那么踏板车的骑行可能会增加足够多,以致于对照组中的用户要比平时更难找到可用的踏板车,这将使对照组不再以“正常”基准行事。)

If we don’t anticipate having network effects, then we can often use an A/B test setup. However, if we do anticipate network effects, then we usually end up using a Pre-Post experiment method or a Switchback test (both of which are defined in section 3b).

如果我们预计不会产生网络影响,那么我们通常可以使用A / B测试设置。 但是,如果我们确实预期到网络效应,那么我们通常最终会使用事后实验方法或切回测试(二者均在第3b节中定义)。

3) Which specific test setup to use?

3)使用哪种特定的测试设置?

3a) When there are no anticipated network effects, generally an A/B test is applicable. A/B tests are controlled tests where some portion of users get the test treatment and some portion of users get the control treatment. The main next step is to determine the size of the test and control groups. Here is a rough guide for how we decide the sizes at Lime:

3a)当没有预期的网络影响时,通常适用A / B测试。 A / B测试是受控测试,其中部分用户接受测试治疗,而部分用户获得对照治疗。 下一步的主要步骤是确定测试组和对照组的规模。 这是我们如何确定Lime尺寸的粗略指南:

3b) When you do have anticipated network effects, then there are two main test options we use at Lime:

3b)当您确实预期到网络影响时,Lime有两个主要的测试选项:

The decision tree below summarizes the methodology explained above.

下面的决策树总结了上面说明的方法。

处理重叠测试 (Dealing with Overlapping Tests)

As Lime scales, we have more tests running simultaneously that could be affecting each other thereby introducing bias. If separate tests’ treatments are impacting the same users and/or same metrics at the same time, then we can’t be sure what the individual impact of a test may truly be (e.g. running two tests that both aim to increase the adoption of the Lime Wallet feature). This table outlines Lime’s 4 solutions to run more than one test at a time with similar impacts :

随着Lime的扩展,我们同时进行了更多的测试,这些测试可能会相互影响,从而产生偏差。 如果不同测试的处理方式同时影响相同的用户和/或相同的指标,那么我们就无法确定测试的个人影响到底是什么(例如,运行两个旨在提高采用率的测试)石灰钱包功能)。 下表概述了Lime的4个解决方案,这些解决方案可以一次运行多个测试,并且产生相似的影响:

测试期间(签入) (During Test (Checking In))

检查采样偏差 (Checking for Sampling Bias)

Whenever running a test, at Lime or elsewhere, we need to ensure that there aren’t inherent biases in our samples in order to trust any insights we draw by comparing a treatment and control group. The most basic way of ensuring there’s no bias is by running the test for a long enough time to have a large enough sample size such that it decreases random bias. In addition to this:

无论何时在Lime或其他地方进行测试,我们都需要确保样本中没有固有偏差,以便通过比较治疗组和对照组来信任我们得出的任何见解。 确保无偏差的最基本方法是通过运行测试足够长的时间以拥有足够大的样本量,从而减少随机偏差。 除此之外:

  1. If the test is pre-assigned (users are all exposed at one static point in time), then we do a quick A/A test, which entails splitting our test and control populations to compare basic metrics between the two groups without yet having the test users exposed to any treatment conditions. Once we’ve confirmed that there are no differences in metrics such as trips per user or app sessions per day, we can rule out the potential for sampling bias.

    如果测试是预先分配的(用户都在一个静态时间点暴露),那么我们将进行快速的A / A测试,这需要将测试和对照组进行拆分以比较两组之间的基本指标, 而无需测试暴露于任何治疗条件的使用者。 确认每位用户的旅行次数或每天的应用会话次数等指标没有差异后,我们就可以排除抽样偏差的可能性。

  2. If the test has live assignment (more users are exposed each day), then an A/A test is not a viable method to rule-out sampling bias. Instead, once the test has reached completion, we check the similarity between metrics pre-exposure between the test and control groups (similar to a post-factual A/A test).如果测试具有实时任务(每天有更多用户访问),则A / A测试不是排除抽样偏差的可行方法。 取而代之的是,一旦测试完成,我们就会检查测试组与对照组之间指标预暴露之间的相似性(类似于事后进行的A / A测试)。

When comparing groups to ensure similarity, we also do a sample ratio mismatch check (i.e. run a significant test on the volume of exposed users) to ensure that a difference in the exposure volumes themselves don’t introduce bias.

在比较组以确保相似性时,我们还会进行抽样比率不匹配检查(即对暴露用户的数量进行重大测试),以确保暴露量本身的差异不会引起偏差。

处理偷看问题 (Dealing with the Peeking Problem)

Peeking (looking at test results before reaching the needed sample size (n)) causes the probability of a Type 1 error (rejecting a true null hypothesis) to increase. The most common way to deal with this is to calculate the required n before starting the test and then only making a decision when you have reached that sample size. However, the drawbacks to that solution are that:

偷看(在达到所需的样本数量(n)之前查看测试结果)会导致类型1错误(拒绝真实的零假设)的可能性增加。 解决此问题的最常见方法是在开始测试之前计算所需的n,然后仅在达到该样本量时才做出决定。 但是,该解决方案的缺点是:

  1. We want to peek to stop bad things from happening or flawed experimental design >> Fix: we can still do this as long as we don’t report the interim results officially; we can just use it as a safeguard in case of negative things occurring.我们想窥见,以防止不良事件的发生或有缺陷的实验设计>>修正:只要我们不正式报告中期结果,我们仍然可以这样做。 我们可以将其用作防范不良事件的保障。
  2. We are too reliant on our power analysis calculations before the test starts (i.e. a priori effect), e.g. we overestimate the effect we expect to see or that we need to see from a business perspective >> Fix: we can always err on the side of a low effect size like 2% over 5%. (Means higher n necessary though so not very sustainable)在测试开始之前,我们过于依赖我们的功率分析计算(即先验效果),例如,我们高估了我们期望看到的效果或需要从业务角度来看的效果>>修复:我们总是会犯错效果较小的效果,例如2%超过5%。 (意味着较高的n是必需的,尽管这样不是很可持续)
  3. The p = .08 problem: It’s clear that we are almost at significance and directionally moving that way. But we need to let the experiment run a bit more, collect more n to get to the 5% threshold >> Fix: we can always err on the side of requiring a larger n than we think we need. But again this doesn’t seem like the optimal solution.p = .08问题:很明显,我们几乎处于重要地位,并且正朝着这种方向发展。 但是我们需要让实验运行更多,收集更多的n以达到5%的阈值>>修正:我们总是会错在要求比我们认为更大的n方面。 但这又不是最佳解决方案。

At Lime, our solution is to run a normal power analysis before the experiment starts to calculate the day on which we should have sufficient sample size to check results. We can only check the final results one time on that planned day and need a 95% confidence level to claim the results are statistically significant. If we want to check results before that time (or are requested to do so), then we can only claim results are stat sig if they meet a 99.9% confidence level as opposed to the initial 95%. This reduces the likelihood that we are falsely rejecting the null hypothesis by recalculating the impact multiple times. We decided that this solution is better suited than using Bayes Factors (another common solution in the industry) since it is a simpler calculation and we are still building out our internal experimentation analysis tools.

在Lime,我们的解决方案是在实验开始计算应该有足够样本量检查结果的日期之前进行常规功效分析。 我们只能在计划的一天一次检查最终结果,并且需要95%的置信度才能声称结果具有统计意义。 如果我们要在此时间之前(或要求这样做)检查结果,那么我们只能声称结果满足统计要求的99.9%(而不是最初的95%)是统计信号。 通过多次重新计算影响,这降低了我们错误地拒绝零假设的可能性。 我们认为,此解决方案比使用贝叶斯因子(行业中的另一种常见解决方案)更适合,因为它是一种更简单的计算,并且我们仍在构建内部实验分析工具。

测试后(结果和推出) (After Test (Results and Roll Out))

选择要使用的统计检验 (Picking a Statistical Test to use)

It’s important for our teams to be aligned on which statistical test to use when, so that we are reading impact in a consistent way. Here are the guidelines that we use at Lime:

对于我们的团队来说,重要的是要协调何时使用哪种统计测试,以便我们以一致的方式阅读影响。 这是我们在Lime使用的准则:

  • For continuous metrics (e.g. revenue per trip), we are comparing means and can generally use a two-tailed t-test. If there is a small sample size, then we use a bootstrap t-test.对于连续指标(例如,单程收入),我们正在比较均值,通常可以使用两尾t检验。 如果样本量较小,则我们使用自举t检验。
  • For ratios (e.g. WoW retention), we also recommend a t-test. We use t-test instead of a z-test because z-tests require more statistical power and are generally only more efficient and powerful when the metric being analyzed is binary.对于比率(例如,WoW保留率),我们还建议进行t检验。 我们使用t检验代替z检验,因为z检验需要更多的统计能力,并且通常仅在所分析的指标为二进制时才更有效率和功能。
  • The above two bullet points assume the metrics have a normal distribution, but at Lime, we often deal with skewed data (e.g. trips per user is left-skewed), in which case we use the Wilcoxon Rank-Sum test for which we don’t need to have an assumption about the distribution.以上两个要点假设指标具有正态分布,但是在Lime,我们经常处理偏斜的数据(例如,每个用户的行程偏左),在这种情况下,我们使用Wilcoxon Rank-Sum检验,而对于需要对分布进行假设。
  • For categorical variables, we would propose using a chi-square test. However, categorical metrics are currently rarely used at Lime, so we haven’t built this into our experiment analysis pipelines.对于分类变量,我们建议使用卡方检验。 但是,Lime目前很少使用分类指标,因此我们还没有将其纳入我们的实验分析管道中。

如何准确地运行多个比较 (How to Accurately Run Multiple Comparisons)

This issue with running multiple comparisons (e.g. if you compare a metric for each Country exposed to the treatment) is that the statistical probability of incorrectly rejecting a true null hypothesis will significantly increase as the number of simultaneously tested hypotheses increases. There are many different ways to help correct this issue — each correction method introduces a trade-off between correcting for false positives and reporting too many false negatives. Given each experiment is nuanced, this trade off of values must be made to best suit the needs of the business. At Lime, we use the Benjamini–Hochberg (BH) procedure to adjust the p-value, which is not only straightforward to implement but also less strict than the Bonferroni correction method (another method widely used in industry), which assumes that each test is independent of each other.

进行多次比较(例如,如果您比较接受治疗的每个国家/地区的度量标准)的问题是,随着同时检验的假设数量的增加,错误拒绝真实的零假设的统计概率将大大增加。 有多种方法可以帮助纠正此问题-每种纠正方法都会在纠正误报与报告过多误报之间进行权衡。 鉴于每个实验都是细微的,必须在价值上进行权衡以最适合企业的需求。 在Lime,我们使用Benjamini-Hochberg(BH)程序来调整p值,该值不仅易于实现,而且不如Bonferroni校正方法(工业上广泛使用的另一种方法)严格,后者假定每次测试彼此独立。

This is a simplified version of how we choose a hypothesis test but shows the main framework. The most common additional test we use is ANOVA for comparing metric impact across multiple variants.

这是我们选择假设检验的简化版本,但显示了主要框架。 我们使用的最常见的附加测试是ANOVA,用于比较多个变体之间的指标影响。

外推至顶线 (Extrapolating to Topline)

Very often people look at the impact seen in a test (e.g. +5% retention) and then assume that’s the lift on topline; however, this is inaccurate if the test is only exposed to a certain population of riders (e.g. only new riders or only Paris riders). As an example, in San Francisco, Lime Riders need to lock their scooters to a bike rack after ending a ride. If we make a change to the locking process, test the improvement only in San Francisco, and see a +10% increase in trips taken, can we claim that we’ll see +10% trips globally? No, because only San Francisco was exposed in this test treatment affecting a feature (lock-to) that is not present in all global markets.

人们通常会查看测试中看到的影响(例如+ 5%保留率),然后认为这是最重要的提升。 但是,如果测试仅针对一定数量的车手(例如,仅新车手或仅巴黎车手)进行,则这是不准确的。 例如,在旧金山,骑车结束后,Lime Riders需要将踏板车锁定在自行车架上。 如果我们更改了锁定流程,仅在旧金山进行了测试,发现出行次数增加了+ 10%,我们是否可以声称在全球范围内出行次数会增加+ 10%? 否,因为只有旧金山暴露在此测试处理中,从而影响了并非所有全球市场都存在的功能(锁定)。

Accurately calculating the topline impact is pretty straightforward — it’s generally solved by multiplying the impact we see by the percentage of users who are eligible for the treatment. Therefore our proposed solution is to determine this percentage when sizing the test (this is often the same as the trigger point, e.g. the proportion of people who open wallet out of all exposed users who open their app) and then to create a table like this for stakeholders to easily understand what the global topline impact is.

准确计算收入的影响非常简单-通常可以通过将我们看到的影响乘以有资格获得治疗的用户百分比来解决。 因此,我们建议的解决方案是在确定测试大小时确定该百分比(通常与触发点相同,例如,在打开其应用程序的所有公开用户中,打开钱包的人的比例),然后创建这样的表格让利益相关者轻松了解全球收入影响是什么。

摘要 (Summary)

By taking the time to standardize experiment analysis methodologies within the engineering team at Lime, we are not only able to maintain a consistent rigor in our analysis across the board, but we are also now able to build our first experimentation analysis platform and incorporate some of these principles. Our analysis platform already incorporates principles such as using BH corrections when measuring significance and will soon include other automated principles such as A/A test checks. As a team, we have many more topics to discuss and standardize (e.g. the best way to check for network effects via counter-metrics), so stay tuned for future blog posts!

通过花时间在Lime的工程团队内部标准化实验分析方法,我们不仅能够在整个分析过程中保持一致的严格性,而且我们现在还能够构建我们的第一个实验分析平台并整合一些这些原则。 我们的分析平台已经包含了一些原理,例如在测量重要性时使用BH校正,并且不久将包括其他自动​​化原理,例如A / A测试检查。 作为一个团队,我们还有更多主题需要讨论和标准化(例如,通过对策检查网络效果的最佳方法),因此请继续关注未来的博客文章!

致谢 (Acknowledgments)

Many members of Lime’s Data Science and Analytics team contributed to our handbook including Tristan Taru, Dounan Tang, Jeh Lokhande, Siyi Luo, and Ben Laufer.

Lime的数据科学和分析团队的许多成员为我们的手册做出了贡献,包括Tristan Taru,Dounan Tang,Jeh Lokhande,Sisi Luo和Ben Laufer。

翻译自: https://medium.com/lime-eng/experimentation-analysis-at-lime-bee846d62dd

石灰吟思维导图视频


http://www.taodudu.cc/news/show-6582971.html

相关文章:

  • 高铝水泥与硅酸盐水泥或石灰混合使用
  • 树上的石灰
  • LinkedIn领英个人商务会员和普通会员相比有什么优点?有必要升级么?
  • Unity UI带阴影的切图 某些角度出现黑边解决方案
  • 前端小白之切图
  • UI这样标注切图,再也不用加班了!
  • UI设计和切图工具
  • 关于 UI设计 切图,我们应该如何给开发人员
  • Python猴子吃桃问题
  • C语言递归——猴子摘桃
  • 猴子吃桃问题——递归算法解答
  • (C语言)猴子摘桃
  • Python 猴子吃桃问题
  • 个推推送项目实用(一)
  • 推送中的消息和通知的区别
  • 消息推送的几种实现方式
  • ubuntu已经安装了中文输入法却还是英文
  • 解决Pycharm输入法无法切换中英文
  • ubuntu下sogou输入法的输入框只显示英文不显示中文的解决方法
  • 京东商城组织架构调整:划分为前中后台 强调实现有质量增长
  • 京东20周年答题活动
  • Android之FFmpeg(3)--添加为视频添加背景音乐
  • 数据波动归因分析
  • 多渠道归因分析:python实现马尔可夫链归因(三)
  • 八步构建跨渠道归因分析
  • 快消出海系列:轻松找准波动原因--快消品出口额下降归因分析
  • 归因分析和技术方案
  • 归因分析笔记21 可解释的机器学习-李宏毅讲座
  • 数仓建模主题--事件归因分析主题
  • 数据分析归因分析类型与实战数据

石灰吟思维导图视频_石灰实验分析相关推荐

  1. xman的思维导图快捷键_一次性入门大纲笔记神器“幕布”,支持一键生成思维导图...

    很多人都有记笔记的习惯,我们的老师经常教导我们"上课要记笔记".其实老师说的不做,只不过我们大部分人并没有按照老师的要求去做,或者把老师的要求当做一种"作业"来 ...

  2. visio思维导图模板_如何下载思维导图模板?在线教你找精美漂亮的思维导图

    成为职场的一员后,我越来越意识到,学会.掌握绘制思维导图是很重要的一件事情.很多学习.工作上的任务.难题,以思维导图的方式汇总.整理出来以后,会变得简单许多.因此,在学习.工作之余,我常常自主学习脑图 ...

  3. xman的思维导图快捷键_这个良心好用的思维导图软件,居然不用氪金充钱

    今天给大家介绍一款免费的在线思维导图工具--GitMind,提供了丰富的功能和模板,可免费导出 JPG.PNG 图片.PDF 文档以及 TXT 文本等多种格式. 此外,GitMind 还集成了制作流程 ...

  4. xman的思维导图快捷键_一图胜千言,免费的多人协作思维导图工具,推荐收藏值得拥有哦...

    今天给大家推荐一款思维导图制作软件:GitMind,捕捉灵感,激发创意. 免费在线思维导图软件,简化逻辑梳理,集思广益,释放创造力在线脑图.思维导图.流程图.工业设计.工程管理,一图涵千面. GitM ...

  5. java思维导图源代码_如何使用思维导图解读java开源项目

    思维导图与java 思维导图是个很神奇的工具,它具有结构化.可视化.更接近人类大脑认知的特点. 我们在阅读项目的时候往往是无头无脑的随便看源码,其实这是种错误的学习的方法.学习得多注重积累,有输入就要 ...

  6. xman的思维导图快捷键_思维导图软件XMind 8 快捷键大全

    快捷键,又称热键或组合键,指通过某些特定的按键.按键顺序或按键组合来完成一个操作.XMind 8 设置有大量的快捷键,以提高用户在制作思维导图时的效率.掌握快捷键后,就不用总是中断编辑.手离键盘去用鼠 ...

  7. xman的思维导图快捷键_思维导图与xmind快捷键

    思维导图常见逻辑结构 思维导图:发散 鱼骨图:比较清晰地表达因果关系 矩阵图:可以用来做项目的任务管理或者个人的计划 时间轴:表示事件顺序或者事情的先后逻辑 组织结构图:可以做组织层次的人员构成 逻辑 ...

  8. xman的思维导图快捷键_思维导图软件——MindMaster常用快捷键汇总

    思维导图,英文是The Mind Map,又叫心智导图,是表达发散性思维的有效图形思维工具 ,它简单却又很有效,是一种实用性的思维工具.今天就为大家带来一款非常好用的思维导图软件--亿图思维导图,Mi ...

  9. Web思维导图实现的技术点分析

    来源 | https://www.cnblogs.com/wanglinmantan/p/15087583.html 简介 思维导图是一种常见的表达发散性思维的有效工具,市面上有非常多的工具可以用来画 ...

最新文章

  1. FMDB使用Cached Statement功能
  2. 使用flex 做关键词、正则表达式过滤
  3. 小试牛刀JavaScript鼠标事件
  4. $\mathbb{R}^n$中点集概念梳理
  5. 如何在Java中使用正则表达式?
  6. 38--合并两个排序的链表
  7. Shell declare的使用方法
  8. linux 串口监视工具_监视Linux的最佳工具
  9. Unity3D C#数学系列之求点到直线的距离
  10. Matlab图例设置
  11. windows 环境下 0x色彩对应表
  12. CSDN_MySQL入门技能树学习整理知识点
  13. 【笔记】Oracle触发器,根据另外一张表是否存在此记录,来判断是否更新
  14. 英语作文衔接句!让你的行文更流畅
  15. 山东省计算机应用能力考核初级,山东省计算机应用能力考核.doc
  16. java的向下转型_Java 向上/向下转型浅析
  17. Python-Python 2 和Python3
  18. pytorch python学习(三)
  19. 软件测试培训分享:软件测试自学能找到工作吗
  20. 数据结构与算法(一)

热门文章

  1. 培训课件通用教育PPT模板
  2. 吸引170万人次看直播!戴尔科技峰会,玩儿真的
  3. 周期性行业是什么意思_周期股是什么意思?周期性高的行业有哪些
  4. 泛函分析笔记(一) 基础的集合与映射
  5. 曹丕《燕歌行二首》其一赏析
  6. python信号处理教程_python玩转信号处理与机器学习入门
  7. transInit通过机构商户号和路由编号查找上游商户
  8. Ubuntu+openni+nite+sensor+配置
  9. 【5G】NG-RAN切换 handover学习笔记
  10. 个人账目管理系统(一)数据库连接