a/b测试

The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.

A / B测试的想法是将不同的内容呈现给不同的变体(用户组)，收集他们的React和用户行为，并使用结果在将来构建产品或营销策略。

A/B testing is a methodology of comparing multiple versions of a feature, a page, a button, headline, page structure, form, landing page, navigation and pricing etc. by showing the different versions to customers or prospective customers and assessing the quality of interaction by some metric (Click-through rate, purchase, following any call to action, etc.).

A / B测试是通过向客户或潜在客户显示不同版本并评估质量来比较功能，页面，按钮，标题，页面结构，表单，着陆页，导航和定价等多个版本的方法按某种指标(点击率，购买，任何号召性用语等)进行互动的次数。

This is becoming increasingly important in a data-driven world where business decisions need to be backed by facts and numbers.

在数据驱动的世界中，这一点变得越来越重要，在这个世界中，业务决策需要事实和数字的支持。

如何进行标准的A / B测试 (How to conduct a standard A/B test)

Formulate your Hypothesis制定假设
Deciding on Splitting and Evaluation Metrics确定划分和评估指标
Create your Control group and Test group创建控制组和测试组
Length of the A/B TestA / B测试时间
Conduct the Test进行测试
Draw Conclusions得出结论

1.提出你的假设 (1. Formulate your hypothesis)

Before conducting an A/B testing, you want to state your null hypothesis and alternative hypothesis:

在进行A / B测试之前，您需要陈述零假设和替代假设：

The null hypothesis is one that states that there is no difference between the control and variant group.The alternative hypothesis is one that states that there is a difference between the control and variant group.

零假设 是一个状态存在 的控制和变体group.The 备选假设 没有区别 是一个状态存在 的控制和变体组之间的差。

Imagine a software company that is looking for ways to increase the number of people who pay for their software. The way that the software is currently set up, users can download and use the software free of charge, for a 7-day trial. The company wants to change the layout of the homepage to emphasise with a red logo instead of blue logo that there is a 7-day trial available for the company’s software.

想象一下，一家软件公司正在寻找增加软件购买费用的人数的方法。用户可以免费下载和使用该软件的当前设置方式，试用期为7天。该公司希望更改首页的布局，以红色徽标代替蓝色徽标来强调该公司的软件有7天的试用期。

Here is an example of hypothesis test: Default action: Approve blue logo.Alternative action: Approve red logo.Null hypothesis: Blue logo does not cause at least 10% more license purchase than red logo.Alternative hypothesis: Red logo does cause at least 10% more license purchase than blue logo.

以下是假设检验的示例： 默认操作：批准蓝色徽标。 替代措施：批准红色徽标。 无假设：蓝色徽标不会导致购买的许可证比红色徽标多至少10％。 替代假设：红色徽标确实导致购买的许可证比蓝色徽标多至少10％。

It’s important to note that all other variables need to be held constant when performing an A/B test.

重要的是要注意，在执行A / B测试时，所有其他变量都必须保持恒定。

2.确定划分和评估指标 (2. Deciding on Splitting and Evaluation Metrics)

We should consider two things: where and how we should split users into experiment groups when entering the website, and what metrics we will use to track the success or failure of the experimental manipulation. The choice of unit of diversion (the point at which we divide observations into groups) may affect what evaluation metrics we can use.

我们应该考虑两件事：进入网站时应在何处以及如何将用户分为实验组，以及我们将使用什么指标来跟踪实验操作的成功或失败。转移单位的选择(将观察分为几组的点)可能会影响我们可以使用的评估指标。

The control, or ‘A’ group, will see the old homepage, while the experimental, or ‘B’ group, will see the new homepage that emphasises the 7-day trial.

对照组(即“ A”组)将看到旧的主页，而实验组(即“ B”组)将看到强调7天试用期的新主页。

Three different splitting metric techniques:

三种不同的拆分指标技术：

a) Event-based diversionb) Cookie-based diversion c) Account-based diversion

a)基于事件的转移b)基于Cookie的转移c)基于帐户的转移

An event-based diversion (like a pageview) can provide many observations to draw conclusions from, but if the condition changes on each pageview, then a visitor might get a different experience on each homepage visit. Event-based diversion is much better when the changes aren’t as easily visible to users, to avoid disruption of experience.

基于事件的转移 (如综合浏览量)可以提供许多观察结果，以得出结论，但是如果条件在每个综合浏览量上都发生变化，那么访问者可能会在每次首页访问中获得不同的体验。当更改对用户而言不那么容易看到时，基于事件的转移要好得多，这样可以避免体验中断。

In addition, event-based diversion would let us know how many times the download page was accessed from each condition, but can’t go any further in tracking how many actual downloads were generated from each condition.

此外，基于事件的转移将使我们知道从每个条件访问了多少次下载页面，但无法进一步跟踪从每个条件产生了多少实际下载。

Account-based can be stable, but is not suitable in this case. Since visitors only register after getting to the download page, this is too late to introduce the new homepage to people who should be assigned to the experimental condition.

基于帐户的帐户可以稳定，但在这种情况下不适合。由于访问者仅在进入下载页面后进行注册，因此将新首页介绍给应该分配到实验条件的人们为时已晚。

So this leaves the consideration of cookie-based diversion, which feels like the right choice. Cookies also allow tracking of each visitor hitting each page. The downside of cookie based diversion, is that it get some inconsistency in counts if users enter the site via incognito window, different browsers, or cookies that expire or get deleted before they make a download. As a simplification, however, we’ll assume that this kind of assignment dilution will be small, and ignore its potential effects.

因此，这无需考虑基于cookie的转移 ，这似乎是正确的选择。 Cookies还可以跟踪每个访问者访问每个页面的情况。基于cookie的转移的缺点是，如果用户通过隐身窗口，不同的浏览器或过期或在下载前被删除的cookie进入站点，则计数会出现一些不一致的情况。但是，为简化起见，我们将假定这种分配稀释很小，并忽略其潜在影响。

In terms of evaluation metrics, we should prefer using the download rate (# downloads / # cookies) and purchase rate (# licenses / # cookies) relative to the number of cookies as evaluation metrics.

在 评估指标方面 ，相对于Cookie数量，我们应该更喜欢使用下载率 (＃次下载/＃cookie)和购买率 (＃个许可/＃cookies)作为评估指标。

Product usage statistics like the average time the software was used in the trial period are potentially interesting features, but aren’t directly related to our experiment. Certainly, these statistics might help us dig deeper into the reasons for observed effects after an experiment is complete. But in terms of experiment success, product usage shouldn’t be considered as an evaluation metric.

产品使用情况统计信息(例如软件在试用期内的平均使用时间)可能是有趣的功能，但与我们的实验没有直接关系。当然，这些统计信息可能有助于我们在实验完成后更深入地观察观察到的效果的原因。但就实验成功而言，不应将产品使用情况视为评估指标。

3.创建您的对照组和测试组 (3. Create your control group and test group)

Once you determine your null and alternative hypothesis, the next step is to create your control and test (variant) group. There are two important concepts to consider in this step, sampling and sample size.

一旦确定了零假设和替代假设，下一步就是创建对照和测试(变量)组。在此步骤中，有两个重要概念需要考虑，即采样和样本量。

SamplingRandom sampling is one most common sampling techniques. Each sample in a population has an equal chance of being chosen. Random sampling is important in hypothesis testing because it eliminates sampling bias, and it’s important to eliminate bias because you want the results of your A/B test to be representative of the entire population rather than the sample itself.

采样随机采样是一种最常见的采样技术。总体中的每个样本都有相等的机会被选中。随机抽样在假设检验中很重要，因为它消除了抽样偏差，而消除偏差也很重要，因为您希望A / B检验的结果能够代表整个总体而不是样本本身。

A problem of A/B tests is that if you haven’t defined your target group properly or you’re in the early stages of your product, you may not know a lot about your customers. If you’re not sure who they are (try creating some user personas to get started!) then you might end up with misleading results. Important to understand which sampling method that suits your use case.

A / B测试的问题是，如果您没有正确定义目标组，或者您处于产品的早期阶段，那么您可能对客户了解的不多。如果不确定他们是谁(尝试创建一些用户角色来开始！)，那么最终可能会产生误导性的结果。重要的是要了解哪种采样方法适合您的用例。

Sample SizeIt’s essential that you determine the minimum sample size for your A/B test prior to conducting it so that you can eliminate under coverage bias, bias from sampling too few observations.

样本大小 ，你先确定你的A / B测试的最小样本量进行，这样你可以在覆盖偏倚 ，从取样太少观察偏见消除它是必不可少的。

4. A / B测试的时间 (4. Length of the A/B test)

A calculator like this one can help you determine the length of time you need to get any real significance from your A/B tests.

像这样的计算器可以帮助您确定从A / B测试中获得任何实际意义所需的时间。

History data shows that there are about 3250 unique visitors per day. There are about 520 software downloads per day (a .16 rate) and about 65 licenses purchased each day (a .02 rate). In an ideal case, both the download rate and license purchase rate should increase with the new homepage; a statistically significant negative change should be a sign to not deploy the homepage change. However, if only one of our metrics shows a statistically significant positive change we should be happy enough to deploy the new homepage

历史数据显示，每天大约有3250位唯一身份访问者。每天大约有520个软件下载( .16比率 )，每天购买约65个许可证( .02比率 )。在理想情况下，下载率和许可证购买率均应随新首页的增加而增加；具有统计意义的负面变化应该是不部署主页更改的标志。但是，如果只有一项指标显示出统计上显着的积极变化，那么我们应该很乐意部署新的首页

For an overall 5% Type I error rate with Bonferroni correction and 80% power, we should require 6 days to reliably detect a 50 download increase per day and 21 days to detect an increase of 10 license purchases per day. Performing both individual tests at a .05 error rate carries the risk of making too many Type I errors. As such, we’ll apply the Bonferroni correction to run each test at a .025 error rate so as to protect against making too many errors.

对于具有Bonferroni校正和80％功率的5％I型错误率，我们应该需要6天才能可靠地检测到每天50个下载量的增加，而需要21天才能检测到每天10个许可证购买量的增加。以.05的错误率执行两项测试都可能导致I型错误过多。因此，我们将应用Bonferroni校正以.025的错误率运行每个测试，以防止发生太多错误。

Use the link above for the test days calculations: Estimated existing conversion rate (%): 16% Minimum improvement in conversion rate you want to detect (%): 50/520*100 %Number of variations/combinations (including control): 2Average number of daily visitors: 3250Percent visitors included in test? 100%Total number of days to run the test: 6 days

使用上面的链接进行测试日计算： 估计现有转化率(％)： 16％ 您要检测的转化率的最小改进(％)： 50/520 * 100％ 变体/组合数(包括对照)： 2 每日平均访客人数： 3250 测试中是否包含访客？ 100％ 运行测试的总天数： 6 天

Estimated existing conversion rate (%): 2 % Minimum improvement in conversion rate you want to detect (%): 10/65*100 %Number of variations/combinations (including control): 2Average number of daily visitors: 3250Percent visitors included in test? 100%Total number of days to run the test: 21 days

估计的现有转化率(％)： 2％ 您要检测的转化率的最低改进(％)： 10/65 * 100％ 变体/组合数(包括对照)： 2 平均每日访问者数量： 3250 包含的 访问者 百分比在测试中？ 100％ 运行测试的总天数： 21天

One thing that isn’t accounted for in the base experiment length calculations is that there is going to be a delay between when users download the software and when they actually purchase a license. That is, when we start the experiment, there could be about seven days before a user account associated with a cookie actually comes back to make their purchase. Any purchases observed within the first week might not be attributable to either experimental condition. As a way of accounting for this, we’ll run the experiment for about one week longer to allow those users who come in during the third week a chance to come back and be counted in the license purchases tally.

在基础实验时长计算中未考虑的一件事是，用户下载软件的时间与实际购买许可证之间将有一个延迟。也就是说，当我们开始实验时，可能需要大约7天的时间，与Cookie相关联的用户帐户才能真正恢复购买。在第一周内观察到的任何购买都可能与实验条件无关。为了说明这一点，我们将实验进行大约一周的时间，以使在第三周内进入的用户有机会回来并计入许可证购买计数。

As for biases, we don’t expect users to come back to the homepage regularly. Downloading and license purchasing are actions we expect to only occur once per user, so there’s no real ‘return rate’ to worry about. One possibility, however, is that if more people download the software under the new homepage, the expanded user base is qualitatively different from the people who came to the page under the original homepage. This might cause more homepage hits from people looking for the support pages on the site, causing the number of unique cookies under each condition to differ. If we do see something wrong or out of place in the invariant metric (number of cookies), then this might be an area to explore in further investigations.

至于偏见，我们不希望用户定期返回首页。下载和购买许可证是我们希望每个用户仅执行一次的操作，因此无需担心真正的“回报率”。但是，一种可能性是，如果有更多的人在新首页下下载该软件，则扩展的用户基础在质量上将不同于访问原始首页下的页面的人。这可能会导致人们在网站上寻找支持页面的点击量增加，从而导致每种情况下唯一Cookie的数量有所不同。如果我们在固定指标(Cookie的数量)中确实发现了错误或不正确的地方，那么这可能是需要进一步研究的领域。

5.进行测试 (5. Conduct the test)

Once you conduct your experiment and collect your data, you want to determine if the difference between your control group and variant group is statistically significant. There are a few steps in determining this:

完成实验并收集数据后，您要确定对照组和变异组之间的差异是否在统计上显着。确定此步骤有几个步骤：

First, you want to set your alpha, the probability of making a type 1 error. Typically the alpha is set at 5% or 0.05

首先，您要设置alpha ，即发生1型错误的概率。通常将alpha设置为5％或0.05
Second, you want to determine the probability value (p-value) by first calculating the t-statistic using the formula above or using z-score.其次，您想通过首先使用上述公式或使用z分数计算t统计量来确定概率值(p值)。
Lastly, compare the p-value to the alpha. If the p-value is greater than the alpha, do not reject the null!最后，将p值与alpha进行比较。如果p值大于alpha，请不要拒绝null！

5.1使用实际统计数据比较结果 (5.1 Use actual statistics to compare the results)

Do not rely on simple 1 on 1 comparison metrics to dictate what works and does not work. “Version A yields a 20 percent conversion rate and Version B yields a 22 percent conversion rate, therefore we should switch to Version B!” Please do not do this. Use actual confidence intervals, z-scores, and statistically significant data.

不要依靠简单的一对一比较指标来确定哪些有效，哪些无效。 “ 版本A产生20％的转换率， 版本B产生22％的转换率，因此我们应该切换到版本B！” 请不要这样做。使用实际的置信区间，z得分和具有统计意义的数据。

5.2产品增长 (5.2 Product Growth)

Changing colours and layout may have a marginal impact on your key performance metrics. However, these results seem to be very short-lived. Product growth does not result from changing a button from red to blue, it comes from building a product that people want to use.

更改颜色和布局可能会对关键绩效指标产生轻微影响。但是，这些结果似乎是短暂的。产品的增长并非来自将按钮从红色更改为蓝色的结果，而是来自构建人们想要使用的产品。

Instead of choosing feature that you think might work, you can use an A/B test to know what works.

您可以使用A / B测试来了解有效的方法，而不是选择您认为可能有效的功能。

5.3分析数据 (5.3 Analyse Data)

For the first evaluation metric, download rate, there was an extremely convincing effect. An absolute increase from 0.1612 to 0.1805 results in a z-score of 7.87 (z-score = 0.1805–0.1612/0.0025) and p-value < .00001, well beyond any standard significance bound. However, the second evaluation metric, license purchasing rate, only shows a small increase from 0.0210 to 0.0213 (following the assumption that only the first 21 days of cookies account for all purchases). This results in a p-value of 0.398 (z = 0.26).

对于第一个评估指标，下载率，具有令人信服的效果。从0.1612到0.1805的绝对增加会导致z得分为7.87(z得分= 0.1805–0.1612 / 0.0025)，p值<.00001，远远超出了任何标准显着性范围。但是，第二个评估指标，即许可证购买率，仅显示从0.0210到0.0213的小幅增长(假设所有购买的数据仅占cookie的前21天)。这导致p值为0.398(z = 0.26)。

6.得出结论 (6. Draw Conclusions)

Despite the fact that statistical significance wasn’t obtained for the number of licenses purchased, the new homepage appeared to have a strong effect on the number of downloads made. Based on our goals, this seems enough to suggest replacing the old homepage with the new homepage. Establishing whether there was a significant increase in the number of license purchases, either through the rate or the increase in the number of homepage visits, will need to wait for further experiments or data collection.

尽管没有获得购买许可证数量的统计意义，但新主页似乎对下载的数量产生了很大影响。根据我们的目标，这似乎足以建议用新主页替换旧主页。要确定购买许可证的数量是否显着增加(无论是通过访问率还是通过首页访问的数量增加)，都需要等待进一步的实验或数据收集。

One inference we might like to make is that the new homepage attracted new users who would not normally try out the program, but that these new users didn’t convert to purchases at the same rate as the existing user base. This is a nice story to tell, but we can’t actually say that with the data as given. In order to make this inference, we would need more detailed information about individual visitors that isn’t available. However, if the software did have the capability of reporting usage statistics, that might be a way of seeing if certain profiles are more likely to purchase a license. This might then open additional ideas for improving revenue.

我们可能要做出的一个推断是，新首页吸引了通常不会试用该程序的新用户，但是这些新用户没有以与现有用户群相同的速度转换为购买商品。这是一个很好的故事，但是我们不能用给定的数据这么说。为了进行推断，我们将需要有关不可用的单个访客的更多详细信息。但是，如果该软件确实具有报告使用情况统计信息的功能，则可能是查看某些配置文件是否更有可能购买许可证的一种方式。然后，这可能会打开其他想法来提高收入。

翻译自: https://towardsdatascience.com/how-to-conduct-a-b-testing-3076074a8458

a/b测试

查看全文

http://www.taodudu.cc/news/show-994961.html

面向数据科学家的实用统计学_数据科学家必知的统计数据
在Python中有效使用JSON的4个技巧
虚拟主机创建虚拟lan_创建虚拟背景应用
python 传不定量参数_Python中的定量金融
贝叶斯朴素贝叶斯_手动执行贝叶斯分析
GitHub动作简介
照顾好自己才能照顾好别人_您必须照顾的5个基本数据
认识数据分析_认识您的最佳探索数据分析新朋友
arima模型怎么拟合_7个统计测试，用于验证和帮助拟合ARIMA模型
天池幸福感的数据处理_了解幸福感与数据（第1部分）
詹森不等式_注意詹森差距
数据分析师需求分析师_是什么让分析师出色？
猫眼电影评论_电影的人群意见和评论家的意见一样好吗？
ai前沿公司_美术是AI的下一个前沿吗？
mardown 标题带数字_标题中带有数字的故事更成功吗？
使用Pandas 1.1.0进行稳健的2个DataFrames验证
rstudio 关联r_使用关联规则提出建议（R编程）
jquery数据折叠_通过位折叠缩小大数据
决策树信息熵计算_决策树熵|熵计算
流式数据分析_流式大数据分析
数据科学还是计算机科学_数据科学101
js有默认参数的函数加参数_函数参数：默认，关键字和任意
相似邻里算法_纽约市-邻里之战
数据透视表和数据交叉表_数据透视表的数据提取
图像处理傅里叶变换图像变化_傅里叶变换和图像床单视图。
滞后分析rstudio_使用RStudio进行A / B测试分析
unity3d 可视化编程_R编程系列：R中的3D可视化
python 数据科学包_什么时候应该使用哪个Python数据科学软件包？
熊猫tv新功能介绍_您应该知道的4种熊猫绘图功能
vs显示堆栈数据分析_什么是“数据分析堆栈”？

a/b测试_如何进行A / B测试？相关推荐

什么是端到端训练测试_为什么端到端测试对您的团队很重要
什么是端到端训练测试 by Phong Huynh 由Phong Huynh 为什么端到端测试对您的团队很重要 (Why End-to-End Testing is Important for You ...
左右声道测试_小说：少年参加测试，直接挑战十只狗恐兽，众人见了惊呼：SS级...
穿过那个奇怪的走廊,韩意将那个队长给他的卡交给了走廊尽头转角边上的与刚才那几个人穿着一样,十六七岁左右,一脸浓妆,身材那是该挺的挺,该翘的翘,在加上那一抹媚笑,让整个看起来有些不伦不类的女子. 原本对 ...
ddr老化测试_塑胶类材料老化测试（Aging Test ）常用的测试标准
材料或者产品的老化目前已经越来越被关注,在日常生活中我们也会经常会碰到,一些涂料在户外使用过程中出现变色.起泡.粉化之类的现象:一些家电外壳如空调的白色外壳,使用一段时间会变黄,同时性能也变差,很容易 ...
随心测试_软测基础_005 测试人员工作内容
接上篇:清楚了_测试人员的工作职责范围,那每项测试活动的具体工作内容有哪些呢? Q1:如何理解测试工程师的工作内容? A1:SX的观点:综合一体化现如今互联网行业高速发展,每一项IT职业的工作职责与 ...
mysql ndb 测试_.部署MYSQL集群 --测试
最近把MYSQL集群给研究了下,并做了一个测试,且成功了!现在总结如下: 一.规划好节点 MGM:192.168.79.135 NDB1:192.168.1.79.136 NDB2:192.168.1 ...
camera客观测试_光学图像测试之屏幕色彩管理测试
作者简介:SongZi,紫光展锐高级光学测试工程师,有近十年的测试经验,负责camera影像.Display相关的测试方案落地.擅长测试方案设计.用户体验研究.数据分析与呈现等. 所谓屏幕色彩管理(简 ...
整机压力测试_加湿器防水检测仪防水测试与气密性检测是怎么做的
加湿器是用来增加空气湿度的一种电子产品,因此需要整体密封以实现很好的防水,防水等级一般在IPX6以上(一般有IP56.IP66.IP67.IP68). 因而气密性防水检测仪是加湿器在生产制造流程中一种 ...
一秒点击屏幕次数测试_安卓App性能专项测试流畅度深度解析
指标背景流畅度,顾名思义是用户感知使用App页面时的流畅情况. "App卡不卡",这是用户最直接的感受. 但是要用量化之后的数据衡量流畅度,在Android平台这边并没有直接有效 ...
spock测试_将Spock 1.3测试迁移到Spock 2.0
spock测试了解Spock 2.0 M1(基于JUnit 5)的期望,如何在Gradle和Maven中迁移到它以及为什么报告发现的问题很重要:). 重要说明 . 我绝对不建议您永久将您的现实项目迁 ...

a/b测试_如何进行A / B测试？