不只是A/B测试：多臂老虎机赌徒实验

点击上方“AI公园”，关注公众号，选择加“星标“或“置顶”

作者：Shaw Lu

编译：ronghuaiyang

导读

学习使用Google Analytics来进行统计，使用汤普森采样和蒙特卡洛模拟来进行k-arm bandit实验。

A/B测试回顾

A/B检验是依赖于统计显著性的经典统计检验。当我们提出一个新的产品特性时，我们可能想要在向整个用户群发布它之前测试它是否有用。测试包括两组：处理组(可以访问新特性)和控制组。然后我们测量两个群体的一个关键指标：网站上平均停留时间(社交网络)，平均结账时间(电子商务)，或者点击率(在线广告)。检验组间差异有无统计学意义。

经典统计检验(z检验、t检验)保证假阳性率不大于α，通常设置为5%。这意味着当处理组和对照组之间没有差异时，检验将有5%的概率发现统计学差异。

一个平衡的AB测试将分配相等的流量给每个组，直到达到足够的样本大小。但是，我们不能在测试期间根据观察到的情况调整流量分配。所以A/B测试的缺点很明显：如果处理组明显优于对照组，我们还需要在对照组上花费大量的流量，才能获得统计学意义。

多臂老虎机赌徒实验

虽然A/B测试是一种频率学方法，但我们也可以从贝叶斯方法进行测试。这是可以理解的，一旦我们看到一种treatment明显更好，我们希望立即增加更多的用户到该treatment。多臂老虎机赌徒实验使这种方式成为可能。

多臂老虎机赌徒实验的基础是贝叶斯更新。每一种处理(称为“arm”，参见下面的类定义)都有成功的概率，被建模为Bernoulli过程。成功的概率是未知的，并由Beta分布建模。随着实验的继续，每个arm接收用户流量，Beta分布也会相应地更新。

class Arm(object):"""Each arm's true click through rate is modeled by a beta distribution."""def __init__(self, idx, a=1, b=1):"""Init with uniform prior."""self.idx = idxself.a = aself.b = bdef record_success(self):self.a += 1def record_failure(self):self.b += 1def draw_ctr(self):return np.random.beta(self.a, self.b, 1)[0]def mean(self):return self.a / (self.a + self.b)

在这篇文章中，我使用了在线广告的谷歌分析的例子。假设有K个臂。每条臂都是一个具有遵循Beta分布的点击率(ctr)的广告。实验的目标是找到点击率最高的广告。

汤普森采样

简而言之，汤普森抽样是一种贪婪的方法，总是选择期望回报最大化的那条臂。在bandit实验的每次迭代中，Thompson sampling简单地从每个arm的Beta分布中抽取一个样本ctr，然后将用户分配到ctr最高的arm上。

对每个arm的Beta分布应用贝叶斯更新

bandit实验中最精彩的部分是Thompson抽样和Bayesian更新协同工作。如果其中一个臂表现良好，则更新其Beta分布参数以记住这一点，汤普森抽样将更有可能从该臂获得高ctr。在整个实验中，表现好的臂会得到更多的流量作为奖励，而表现差的臂会减少流量作为惩罚。

def thompson_sampling(arms):"""Stochastic sampling: take one draw for each armdivert traffic to best draw.@param arms list[Arm]: list of Arm objects@return idx int: index of winning arm from sample"""sample_p = [arm.draw_ctr() for arm in arms]idx = np.argmax(sample_p)return idx

蒙特卡罗模拟

虽然Beta分布估计了ctr，但我们需要知道我们对ctr的每个估计有多自信。如果我们对目前拥有最高ctr的臂有足够的信心，我们就可以结束实验。

蒙特卡罗模拟

蒙特卡洛模拟的工作方式是多次从K个臂中随机抽取样本，然后经验地计算每一个臂获胜的频率(最高ctr)。如果获胜的一方以足够大的优势击败另一方，实验就会终止。

![1_il_pxHy3TBbTO2eB8saZ5g](Beyond AB Testing Multi-armed Bandit Experiments.assets/1_il_pxHy3TBbTO2eB8saZ5g.gif)def monte_carlo_simulation(arms, draw=100):"""Monte Carlo simulation of thetas. Each arm's click throughrate follows a beta distribution.Parameters----------arms list[Arm]: list of Arm objects.draw int: number of draws in Monte Carlo simulation.Returns-------mc np.matrix: Monte Carlo matrix of dimension (draw, n_arms).p_winner list[float]: probability of each arm being the winner."""# Monte Carlo samplingalphas = [arm.a for arm in arms]betas = [arm.b for arm in arms]mc = np.matrix(np.random.beta(alphas, betas, size=[draw, len(arms)]))# count frequency of each arm being winner counts = [0 for _ in arms]winner_idxs = np.asarray(mc.argmax(axis=1)).reshape(draw,)for idx in winner_idxs:counts[idx] += 1# divide by draw to approximate probability distributionp_winner = [count / draw for count in counts]return mc, p_winner

终止

谷歌Analytics引入了“实验剩余值”的概念。在每次蒙特卡罗模拟中，计算剩余值。如果选择近似α = 5%，则在蒙特卡罗模拟中95%的样本的剩余值小于获胜arm值的1%时，实验终止。

def should_terminate(p_winner, est_ctrs, mc, alpha=0.05):"""Decide whether experiument should terminate. When value remaining inexperiment is less than 1% of the winning arm's click through rate.Parameters----------p_winner list[float]: probability of each arm being the winner.est_ctrs list[float]: estimated click through rates.mc np.matrix: Monte Carlo matrix of dimension (draw, n_arms).alpha: controlling for type I error@returns bool: True if experiment should terminate."""winner_idx = np.argmax(p_winner)values_remaining = (mc.max(axis=1) - mc[:, winner_idx]) / mc[:, winner_idx]pctile = np.percentile(values_remaining, q=100 * (1 - alpha))return pctile < 0.01 * est_ctrs[winner_idx]

模拟

定义了上面的效用函数之后，将它们放在一起就很简单了。对于每个迭代，都会有一个新用户到达。我们应用Thompson抽样来选择臂并查看用户是否单击。然后更新臂的Beta参数，检查我们是否对获胜的臂有足够的信心结束实验。

注意，我引入了一个老化参数。这是在宣布获胜者之前必须运行的最小迭代次数。实验的开始是最热闹的时期，任何失败的手臂都有可能偶然领先。老化周期有助于防止在噪音安定下来之前过早结束实验。

实际上，这也有助于控制新奇效果、冷启动和其他用户心理相关的混淆变量。谷歌分析迫使所有bandit实验至少运行2周。

def k_arm_bandit(ctrs, alpha=0.05, burn_in=1000, max_iter=100000, draw=100, silent=False):"""Perform stochastic k-arm bandit test. Experiment is terminated whenvalue remained in experiment drops below certain threshold.Parameters----------ctrs list[float]: true click through rates for each arms.alpha float: terminate experiment when the (1 - alpha)th percentileof the remaining value is less than 1% of the winner's click through rate.burn_in int: minimum number of iterations.max_iter int: maxinum number of iterations.draw int: number of rows in Monte Carlo simulation.silent bool: print status at the end of experiment.Returns-------idx int: winner's index.est_ctrs list[float]: estimated click through rates.history_p list[list[float]]: storing est_ctrs and p_winner.traffic list[int]: number of traffic in each arm."""n_arms = len(ctrs)arms = [Arm(idx=i) for i in range(n_arms)]history_p = [[] for _ in range(n_arms)]for i in range(max_iter):idx = thompson_sampling(arms)arm, ctr = arms[idx], ctrs[idx]# update arm's beta parametersif np.random.rand() < ctr:arm.record_success()else:arm.record_failure()# record current estimates of each arm being winnermc, p_winner = monte_carlo_simulation(arms, draw)for j, p in enumerate(p_winner):history_p[j].append(p)# record current estimates of each arm's ctrest_ctrs = [arm.mean() for arm in arms]# terminate when value remaining is negligibleif i >= burn_in and should_terminate(p_winner, est_ctrs, mc, alpha):if not silent: print("Terminated at iteration %i"%(i + 1))breaktraffic = [arm.a + arm.b - 2 for arm in arms]return idx, est_ctrs, history_p, traffic

优点

bandit实验的主要优点是它相比A/B测试会提前终止，因为它需要更小的样本。在点击率分别为4%和5%的双臂实验中，传统的A/B测试在95%显著性水平下，每个处理组需要11,165个样本。由于每天只有100名用户，这项实验将持续223天。而在bandit实验中，模拟在31天后结束，终止标准为上述。

每天发送到失败的臂的流量(“错误”)。

bandit实验的第二个优势是该实验比A/B测试的错误更少。一个平衡的A/B测试总是将50%的流量发送给每个组。从上图可以看出，随着实验的进行，发送到失败的臂的流量越来越少。

这是一个5臂bandit的模拟实验。我们发现，在前150个迭代中，红色的手臂(ctr为4.4%)被误认为是获胜的臂，我们将高达80%的流量分流到失败的臂上。但真正的蓝臂(ctr 4.8%)迎头赶上，成为真正的赢家。

权衡

没有免费的午餐，更小的样本规模带来的便利是以更大的假阳性率为代价的。虽然我在实验中使用了多次的实证作为假阳性率来终止实验，但是经过多次的模拟后，假阳性率还是要高于α。

α vs. 样本容量(流量)以及找到正确赢家的概率

根据经验，5%的概率发现获胜的概率是91%，而不是95%。我们设定的α越小，我们需要的样本量越大(红色表示)，这与A/B测试的行为一致。

总结

事实证明，没有确定的赢家，在做出选择之前，产品经理、数据科学家和从业者了解这两种方法的优缺点是很重要的。在下列情况下，最好进行多臂bandit测试：

把用户送到失败的臂上的成本很高。在这个例子中，将用户与糟糕的广告匹配会导致更少的收入。损失是可以承受的。在其他情况下，比如在测试两种不同的帐户恢复方法时，每一种方法的失败都意味着永久丢失一个用户，而多臂bandit实验显然是一个更好的选择。
对于用户流量不足的早期初创公司，multi-armed bandit实验效果更好，因为它需要更小的样本量，提前终止，比A/B测试更敏捷。
当有两个以上的变种需要测试时，bandit实验可以帮助我们快速找到赢家。在bandit实验中，通常一次测试4~8个变种，而A/B测试每次只能测试两组。
multi-armed bandit测试的一个明显限制是，每只臂都必须由Beta分布建模，这意味着每次尝试臂时，都会导致成功或失败。这对于建立点击率和转换率是很好的，但是如果你要测试哪一个检验过程更快，你必须在平均值差异上做t检验。

另一方面，当公司有足够大的用户群，当控制假阳性错误是很重要的时候，当只有很少的变体的时候，A/B测试是一个更好的选择，我们可以每次与对照组一起测试其中之一。

—END—

英文原文：https://towardsdatascience.com/beyond-a-b-testing-multi-armed-bandit-experiments-1493f709f804

请长按或扫描二维码关注本公众号

喜欢的话，请给我个好看吧！