经典机器学习系列(十四)PAC-Learning

文章目录

PAC学习模型
- 定义 Generalization error ：
- 定义 Empirical error ：
Learning axisaligned rectangles
PAC learning
Guarantees for finite hypothesis sets — consistent case
Guarantees for finite hypothesis sets — inconsistent case
Generalities
霍夫丁不等式(Hoeffding’s Inequality)
参考

当在设计一个算法的时候：

怎么样才能学习地更有效率？
什么样的问题天生很难学？
需要多少样本才能学好？
所学出来的model泛化能力好吗？

PAC学习模型

这就是PAC要做的事情，解决上述疑问。PAC英文全程Probably Approximately Correct。在开始介绍这个framework之前需要定义一些notation：

X\mathcal{X}X：examples或称为instances的集合，有时也代表输入空间。
Y\mathcal{Y}Y：labels或者称为target的集合。
c:X→Yc:\mathcal{X} \rightarrow \mathcal{Y}c:X→Y：称之为一个概念(concept)，由于是一个二分类问题，所以ccc可以被定义为X\mathcal{X}X中label全为1的一个子集。
C\mathcal{C}C: 所有的概念(concept)组成concept class，是需要去learn的。

考虑一个binary classi cation问题，之后扩展到更一般的情形。假设样本是独立地满足一个固定但是未知的分布D\mathcal{D}D。

从D\mathcal{D}D中sample一些样本S=(x1,⋯,xm)S=(x_{1},\cdots,x_{m})S=(x1,⋯,xm)和它的标签(c(x1),⋯,c(xm))(c(x_{1}),\cdots,c(x_{m}))(c(x1),⋯,c(xm))，之后基于这些带label的样本选择hypothesis hS∈Hh_{S}\in\mathcal{H}hS∈H，使得其泛化误差(generalization error)最小。

定义 Generalization error ：

给定hypothesis h∈Hh \in \mathcal{H}h∈H，target concept c∈Cc \in \mathcal{C}c∈C和一个潜在distribution D\mathcal{D}D，generalization error或者hhh的risk被定义为：

R(h)=Px∼D[h(x)≠c(x)]=Ex∼D[1h(x)≠c(x)]R(h)=\underset{x \sim \mathcal{D}}{\mathbb{P}}[h(x) \neq c(x)]=\underset{x \sim \mathcal{D}}{\mathbb{E}}\left[1_{h(x) \neq c(x)}\right]R(h)=x∼DP[h(x)=c(x)]=x∼DE[1h(x)=c(x)]

可以看出某个hypothesis的generalization error并不是全部由learner所导致的，与distribution DDD 和 target concept ccc 都有关系。

定义 Empirical error ：

给定一个hypothesis h∈Hh \in \mathcal{H}h∈H，一个target concept c∈Cc \in \mathcal{C}c∈C，一个sample S=(x1,⋯,xm)S =(x_{1},\cdots,x_{m})S=(x1,⋯,xm)，empirical risk被定义为：

R^S(h)=1m∑i=1m1h(xi≠c(xi))\hat{R}_{S}(h) = \frac{1}{m} \sum_{i=1}^{m}1_{h(x_{i}\neq c(x_{i}))} R^S(h)=m1i=1∑m1h(xi=c(xi))

可以看出empirical risk是sample下的平均error，而generalization error是在distributionD\mathcal{D}D下的expected error。

对于一个固定的h∈Hh \in \mathcal{H}h∈H，基于独立同分布采样得到的期望empirical error与generalization error是相等的：

ES∼Dm[R^S(h)]=R(h)\underset{S \sim \mathcal{D}^{m}}{\mathbb{E}}\left[\widehat{R}_{S}(h)\right]=R(h)S∼DmE[RS(h)]=R(h)

Learning axisaligned rectangles

考虑这样一种情况，样本是平面上的一些点，X∈R2\mathcal{X} \in \mathbb{R}^{2}X∈R2，概念类(concept class)是R2\mathbb{R}^{2}R2中所有的与轴平行的矩形，因此每个concept ccc 是轴平行矩形中的点，矩形内部为1，外部为2。要学习的问题就是基于有label的训练数据确定能够最小error的轴平行矩形框：

如上图所示RRR表示目标axisaligned rectangles，R′R^{\prime}R′是hypothesis。如上图所示误差主要来自两部分：

false negatives：在矩形RRR内，但是不在矩形R′R^{\prime}R′内。也就是实际label为1的被R′R^{\prime}R′标记为0。
false positives：在矩形R′R^{\prime}R′内，但是不在矩形RRR内。也就是实际label为0的被R′R^{\prime}R′标记为1。

有这样一个算法A\mathcal{A}A，给一些被标记的样本SSS，算法会返回一个最紧的轴平行矩形(tightest axis-aligned rectangle) R′=RsR^{\prime}=R_{s}R′=Rs ，包含了label为1的样本，如下图2.2所示：

依据定义可知，RsR_{s}Rs中不包含任何false postives，因此RsR_{s}Rs错误的区域(region)被包含在RRR中。

记R∈CR \in \mathcal{C}R∈C为target concept。固定ε>0\varepsilon >0ε>0。P[R]\mathbb{P}[R]P[R]被定义为随机从分布D\mathcal{D}D中采一个点，落入到RRR内的概率。因为由算法所导致的误差的点只会落入到RRR内，所以假设P[R]>ε\mathbb{P}[R] > \varepsilonP[R]>ε。

定义好了P[R]>ε\mathbb{P}[R] > \varepsilonP[R]>ε之后，再定义四个延边矩形r1,r2,r3,r4r_{1},r_{2},r_{3},r_{4}r1,r2,r3,r4，每个的概率至少都是ε/4\varepsilon /4ε/4。这些region可以从整个矩形RRR开始构造，然后移动其中的一条边，使其尽可能地小，但是要保证其distribution mass至少都是ε/4\varepsilon /4ε/4，即P[ri]≥ε/4\mathbb{P}[r_{i}] \geq \varepsilon /4P[ri]≥ε/4。如下图所示：

令lll，rrr，bbb和ttt是四个真实的值，RRR定义为：R=[l,r]×[b,t]R=[l,r] \times [b,t]R=[l,r]×[b,t]。那对于左边的r4r_{4}r4区域可以被定义成r4=[l,s4]×[b,t]r_{4}=[l,s_{4}] \times [b,t]r4=[l,s4]×[b,t]，其中s4=inf{s:P[[l,s]×[b,t]]≥1/4}s_{4}=\text{inf} \{ s: \mathbb{P}[[l,s]\times[b,t]] \geq 1/4 \}s4=inf{s:P[[l,s]×[b,t]]≥1/4}。

如果RSR_{S}RS与四个沿边矩形rir_{i}ri都有交集，因为其是矩形，所有每个regionrir_{i}ri中都有RSR_{S}RS的一条边，也就是RSR_{S}RS一定在矩形RRR内，即R(RS)≤εR(R_{S})\leq \varepsilonR(RS)≤ε。相反，如果R(RS)>εR(R_{S}) > \varepsilonR(RS)>ε，那至少有一个region rir_{i}ri与RSR_{S}RS没有边相交。可以写成：

PS∼Dm[R(RS)>ϵ]≤PS∼Dm[∪i=14{RS∩ri=∅}]≤∑i=14PS∼Dm[{RS∩ri=∅}](by the union bound) ≤4(1−ϵ/4)m(since P[ri]≥ϵ/4)≤4exp⁡(−mϵ/4)\begin{aligned} &\begin{aligned} \underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(\mathrm{R}_{\mathrm{S}}\right)>\epsilon\right] & \leq \underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[\cup_{i=1}^{4}\left\{\mathrm{R}_{\mathrm{S}} \cap r_{i}=\emptyset\right\}\right] \\ & \leq \sum_{i=1}^{4} \underset{S \sim D^{m}}{\mathbb{P}}\left[\left\{\mathrm{R}_{\mathrm{S}} \cap r_{i}=\emptyset\right\}\right] \quad \text { (by the union bound) } \\ &\leq 4(1-\epsilon / 4)^{m} \quad \quad \quad \quad \quad \quad \quad \left(\text { since } \mathbb{P}\left[r_{i}\right] \geq \epsilon / 4\right) \\ &\leq 4 \exp (-m \epsilon / 4) \end{aligned} \end{aligned}S∼DmP[R(RS)>ϵ]≤S∼DmP[∪i=14{RS∩ri=∅}]≤i=1∑4S∼DmP[{RS∩ri=∅}] (by the union bound) ≤4(1−ϵ/4)m( since P[ri]≥ϵ/4)≤4exp(−mϵ/4)

也就是没有样本点在沿边矩形rir_{i}ri的内部或者边上。上式中我们使用了一个通用的不等式1−x≤e−x1-x \leq e^{-x}1−x≤e−x，对于任意δ>0\delta >0δ>0，要使得PS∼Dm[R(RS)>ϵ]≤δ\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(\mathrm{R}_{\mathrm{S}}\right)>\epsilon\right] \leq \deltaS∼DmP[R(RS)>ϵ]≤δ，那么就需要4exp⁡(−mϵ/4)≤δ4 \exp (-m \epsilon / 4) \leq \delta4exp(−mϵ/4)≤δ进一步得到m≥4εlog4δm \geq \frac{4}{\varepsilon}log\frac{4}{\delta}m≥ε4logδ4。

因此可以知道，存在这样的算法，对于任意的ε>0\varepsilon >0ε>0和δ>0\delta >0δ>0，如果样本数量m≥4εlog4δm \geq \frac{4}{\varepsilon}log\frac{4}{\delta}m≥ε4logδ4，就能推导到PS∼Dm[R(RS)>ϵ]≤δ\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(\mathrm{R}_{\mathrm{S}}\right)>\epsilon\right] \leq \deltaS∼DmP[R(RS)>ϵ]≤δ，那么axis-aligned rectangles这个问题就是PAC-learnable，其采样复杂度为O(4εlog4δ)O(\frac{4}{\varepsilon}log\frac{4}{\delta})O(ε4logδ4)。

令δ\deltaδ与upper bound 4exp⁡(−mϵ/4)4 \exp (-m \epsilon / 4)4exp(−mϵ/4)相等，可得ε=4mlog4δ\varepsilon=\frac{4}{m}log\frac{4}{\delta}ε=m4logδ4，依据PS∼Dm[R(RS)>ϵ]≤δ\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(\mathrm{R}_{\mathrm{S}}\right)>\epsilon\right] \leq \deltaS∼DmP[R(RS)>ϵ]≤δ可得，R(Rs)R(R_{s})R(Rs)误差小于4mlog4δ\frac{4}{m}log\frac{4}{\delta}m4logδ4的概率为1−δ1-\delta1−δ。

要注意的是这里的hypothesis set H\mathcal{H}H是无限的。上述证明PAC-learnable用了矩形的几何关系，是整个证明的key，上述的证明过程泛化能力并不是很强。之后会在finite hypothesis set下证明更一般的情形。

PAC learning

在开始证明之前，我们需要先来了解一下这个PAC-learning定义：

如果存在一个算法A\mathcal{A}A和一个多项式函数plot(⋅,⋅,⋅,⋅)plot(\cdot,\cdot,\cdot,\cdot)plot(⋅,⋅,⋅,⋅)，对于任意的ε>0\varepsilon >0ε>0和δ>0\delta >0δ>0，对X\mathcal{X}X上所有的分布D\mathcal{D}D和任意一个target concept c∈Cc \in \mathcal{C}c∈C，当样本m≥ploy(1/ε,1/δ,n,size(c))m \geq ploy(1/\varepsilon,1/\delta,n,size(c))m≥ploy(1/ε,1/δ,n,size(c))时，有：

PS∼Dm[R(RS)≤ϵ]≥1−δ\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(\mathrm{R}_{\mathrm{S}}\right) \leq \epsilon\right] \geq 1-\delta S∼DmP[R(RS)≤ϵ]≥1−δ

也就是在很大概率(1−δ1-\delta1−δ)上是近似正确的(误差为ε\varepsilonε)。

有一些关键知识点：

PAC框架是与分布D\mathcal{D}D无关的，因为其没有对分布做任何假设。
样本是从相同的分布D\mathcal{D}D中采样得到的。
PAC框架处理的是概念类(concept class) C\mathcal{C}C(is known to the algorithm)，而不是目标类(target concept) c∈Cc \in \mathcal{C}c∈C (is unknown)。

Guarantees for finite hypothesis sets — consistent case

在axis-aligned rectangles的例子中，算法给出的hypothesis hSh_{S}hS总是consistent，也就是说在training sample SSS上是没有error的。针对consistent hypotheses情形来证generalization bound，假定target concept ccc 在H\mathcal{H}H中。

定理(Learning bound有限H\mathcal{H}H，consistent情况下)：H\mathcal{H}H是X\mathcal{X}X到Y\mathcal{Y}Y的一个有限集合。A\mathcal{A}A是一个从任意target concept c∈Hc \in \mathcal{H}c∈H，样本SSS满足独立同分布，的一个学习算法，返回一个consistent hypothesis hS:R^S(hS)=0h_{S}:\hat{R}_{S}(h_{S})=0hS:R^S(hS)=0。对任意的ε,δ>0\varepsilon , \delta >0ε,δ>0，如果PS∼Dm[R(RS)≤ϵ]≥1−δ\underset{S \sim \mathcal{D}^{m}}{\mathbb{P}}\left[R\left(\mathrm{R}_{\mathrm{S}}\right) \leq \epsilon\right] \geq 1-\deltaS∼DmP[R(RS)≤ϵ]≥1−δ，则需要：

m≥1ε(log∣H∣+log1δ)m \geq \frac{1}{\varepsilon} (log|\mathcal{H}|+log\frac{1}{\delta}) m≥ε1(log∣H∣+logδ1)

用generalization bound来描述：对于任意ε,δ\varepsilon,\deltaε,δ，至少以概率1−δ1-\delta1−δ：

R(hS)≤1m(log∣H∣+log1δ)R(h_{S}) \leq \frac{1}{m} (log|\mathcal{H}|+log\frac{1}{\delta}) R(hS)≤m1(log∣H∣+logδ1)

Proof：固定ε>0\varepsilon >0ε>0，定义Hε={h∈H：R(h)≥ε}\mathcal{H_{\varepsilon}}=\{h \in \mathcal{H}：R(h) \geq \varepsilon\}Hε={h∈H：R(h)≥ε}(泛化误差的含义是对随机一个样本，预测错误的概率。)，对于h∈HEh \in \mathcal{H}_{E}h∈HE，它在training sample SSS下的consistent(每个样本误差为0的情况)可表示为：

P[R^S(h)=0]≤(1−ε)m\mathbb{P}[\hat{R}_{S}(h)=0] \leq (1-\varepsilon)^{m} P[R^S(h)=0]≤(1−ε)m

通过联合bound，有：

P[∃h∈Hϵ:R^S(h)=0]=P[R^S(h1)=0∨⋯∨R^S(h∣Hc∣)=0]≤∑h∈HϵP[R^S(h)=0]union bound≤∑h∈Hε(1−ϵ)m≤∣H∣(1−ϵ)m≤∣H∣e−mϵ\begin{aligned} \mathbb{P}\left[\exists h \in \mathcal{H}_{\epsilon}: \widehat{R}_{S}(h)=0\right]& =\mathbb{P}\left[\widehat{R}_{S}\left(h_{1}\right)=0 \vee \cdots \vee \widehat{R}_{S}\left(h_{\left|\mathcal{H}_{c}\right|}\right)=0\right]\\ & \leq \sum_{h \in \mathcal{H}_{\epsilon}} \mathbb{P}\left[\widehat{R}_{S}(h)=0\right] \quad \quad \text{union bound}\\ & \leq \sum_{h \in \mathcal{H}_{\varepsilon}}(1-\epsilon)^{m} \leq|\mathcal{H}|(1-\epsilon)^{m} \leq|\mathcal{H}| e^{-m \epsilon} \end{aligned} P[∃h∈Hϵ:RS(h)=0]=P[RS(h1)=0∨⋯∨RS(h∣Hc∣)=0]≤h∈Hϵ∑P[RS(h)=0]union bound≤h∈Hε∑(1−ϵ)m≤∣H∣(1−ϵ)m≤∣H∣e−mϵ

又有(事件的包含关系，左边事情发生，则右边事情一定发生)：

P[R(hS)>ϵ]≤P[∃h∈Hϵ:R^S(h)=0]P\left[R\left(h_{S}\right)>\epsilon\right] \leq P\left[\exists h \in \mathcal{H}_{\epsilon}: \hat{R}_{S}(h)=0\right]P[R(hS)>ϵ]≤P[∃h∈Hϵ:R^S(h)=0]

令等式右边等于δ\deltaδ即可得证。

Guarantees for finite hypothesis sets — inconsistent case

上一小节的证明是在consistent情况下的证明，然而在大多数情况下是达不到这样一种情况的。更一般的假设可以采用Hoeffding inequality来得到generalization error和empirical error之间的关系。

Corollary(推论)：固定ε>0\varepsilon >0ε>0，对于任意hypothesis h：X→{0,1}h：X \rightarrow \{0,1\}h：X→{0,1}，有以下inequalities：

PS∼Dm[R^S(h)−R(h)≥ϵ]≤exp⁡(−2mϵ2)PS∼Dm[R^S(h)−R(h)≤−ϵ]≤exp⁡(−2mϵ2)\begin{array}{l} \underset{S \sim D^{m}}{\mathbb{P}}\left[\widehat{R}_{S}(h)-R(h) \geq \epsilon\right] \leq \exp \left(-2 m \epsilon^{2}\right) \\ \underset{S \sim D^{m}}{\mathbb{P}}\left[\widehat{R}_{S}(h)-R(h) \leq-\epsilon\right] \leq \exp \left(-2 m \epsilon^{2}\right) \end{array}S∼DmP[RS(h)−R(h)≥ϵ]≤exp(−2mϵ2)S∼DmP[RS(h)−R(h)≤−ϵ]≤exp(−2mϵ2)

union在一起，得到：

PS∼Dm[∣R^S(h)−R(h)∣≥ϵ]≤2exp⁡(−2mϵ2)\begin{array}{l} \underset{S \sim D^{m}}{\mathbb{P}}\left[|\widehat{R}_{S}(h)-R(h)| \geq \epsilon\right] \leq 2 \exp \left(-2 m \epsilon^{2}\right) \end{array} S∼DmP[∣RS(h)−R(h)∣≥ϵ]≤2exp(−2mϵ2)

Theorem(learning bound - finite H\mathcal{H}H, inconsistent case)：H\mathcal{H}H是一个finite hypothesis 集合。对于任意δ>0\delta >0δ>0，有概率至少1−δ1-\delta1−δ以下式子成立：

∀h∈H,R(h)≤R^S(h)+log⁡∣H∣+log⁡2δ2m\forall h \in \mathcal{H}, \quad R(h) \leq \widehat{R}_{S}(h)+\sqrt{\frac{\log |\mathcal{H}|+\log \frac{2}{\delta}}{2 m}}∀h∈H,R(h)≤RS(h)+2mlog∣H∣+logδ2

Proof：h1,⋯,hHh_{1},\cdots,h_{\mathcal{H}}h1,⋯,hH是H\mathcal{H}H中的elements。采用corollary将其union在一起得到：

P[∃h∈H∣R^S(h)−R(h)∣>ϵ]=P[(∣R^S(h1)−R(h1)∣>ϵ)∨…∨(∣R^S(h∣H∣)−R(h∣H∣)∣>ϵ)]≤∑h∈HP[∣R^S(h)−R(h)∣>ϵ]≤2∣H∣exp⁡(−2mϵ2)\begin{aligned} &\mathbb{P}\left[\exists h \in \mathcal{H}\left|\widehat{R}_{S}(h)-R(h)\right|>\epsilon\right] \\ &=\mathbb{P}\left[\left(\left|\widehat{R}_{S}\left(h_{1}\right)-R\left(h_{1}\right)\right|>\epsilon\right) \vee \ldots \vee\left(\left|\widehat{R}_{S}\left(h_{|\mathcal{H}|}\right)-R\left(h_{|\mathcal{H}|}\right)\right|>\epsilon\right)\right] \\ & \leq \sum_{h \in \mathcal{H}} \mathbb{P}\left[\left|\widehat{R}_{S}(h)-R(h)\right|>\epsilon\right] \\ & \leq 2|\mathcal{H}| \exp \left(-2 m \epsilon^{2}\right) \end{aligned}P[∃h∈H∣∣∣RS(h)−R(h)∣∣∣>ϵ]=P[(∣∣∣RS(h1)−R(h1)∣∣∣>ϵ)∨…∨(∣∣∣RS(h∣H∣)−R(h∣H∣)∣∣∣>ϵ)]≤h∈H∑P[∣∣∣RS(h)−R(h)∣∣∣>ϵ]≤2∣H∣exp(−2mϵ2)

令右边等于δ\deltaδ得证。consistent在上述等式下也是成立的，这是一个更加松的bound。从这里就可以得到hypothesis的大小，样本大小和误差之间的关系。

Generalities

更一般的情况，输出的label是输入数据的一个概率，比如说输入身高和体重预测这个人的性别这种问题。也就是说label是一个概率分布这样。

定义 Agnostic PAC learning：

如果label可以被某个function f：X→Yf：\mathcal{X} \rightarrow \mathcal{Y}f：X→Y 独一无二地确定下来，将其称为deterministic，会存在某个target function使得generalization error R(h)=0R(h)=0R(h)=0，在stochastic情形下就不存在说会使得某个hypothesis下的error为0：

定义(Bayes error)：给定distribution D\mathcal{D}D，Bayes error R∗R^{*}R∗被定义为measurable function h:X→Yh:\mathcal{X} \rightarrow \mathcal{Y}h:X→Y的误差下界：

R⋆=inf⁡hhmeasurable R(h)R^{\star}=\underset{h \text { measurable }}{\inf _{h}} R(h)R⋆=h measurable hinfR(h)

hypothesis hhh with R(h)=R∗R(h)=R^{*}R(h)=R∗被称作Bayes hypothesis或者是Bayes classi er。通过定义可知，在deterministic的情况下R∗=0R^{*}=0R∗=0，但是在stochastic的情况下R∗≠0R^{*} \neq 0R∗=0。

Bayes classi er hBayesh_{Bayes}hBayes可以被定义成以下条件概率的情形：

∀x∈X,hBayes (x)=argmax⁡y∈{0,1}P[y∣x]\forall x \in X, \quad h_{\text {Bayes }}(x)=\underset{y \in\{0,1\}}{\operatorname{argmax}} \mathbb{P}[y \mid x]∀x∈X,hBayes (x)=y∈{0,1}argmaxP[y∣x]

The average error made hy hBayesh_{Bayes}hBayes on x∈Xx \in \mathcal{X}x∈X is thus min {P[0∣x],P[1∣x]}\{\mathbb{P}[0|x],\mathbb{P}[1|x]\}{P[0∣x],P[1∣x]}，and this is the minimum possible error. This leads to the following definition of noise。

霍夫丁不等式(Hoeffding’s Inequality)

霍夫丁不等式（英语：Hoeffding’s inequality）适用于有界的随机变量。设有两两独立的一系列随机变量 X1,…,Xn⁣{\displaystyle X_{1},\dots ,X_{n}\!}X1,…,Xn。假设对所有的 1≤i≤n{\displaystyle 1\leq i\leq n}1≤i≤n，XiXi{\displaystyle X_{i}} X_{i}XiXi都是几乎有界的变量，即满足：P(Xi∈[ai,bi])=1.⁣{\displaystyle \mathbb {P} (X_{i}\in [a_{i},b_{i}])=1.\!}P(Xi∈[ai,bi])=1.。

那么这nnn个随机变量的经验期望：

X‾=X1+⋯+Xnn{\displaystyle {\overline {X}}={\frac {X_{1}+\cdots +X_{n}}{n}}}X=nX1+⋯+Xn

满足以下的不等式：

P(X‾−E[X‾]≥t)≤exp⁡(−2t2n2∑i=1n(bi−ai)2),⁣{\displaystyle \mathbb {P} ({\overline {X}}-\mathbb {E} [{\overline {X}}]\geq t)\leq \exp \left(-{\frac {2t^{2}n^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right),\!} P(X−E[X]≥t)≤exp(−∑i=1n(bi−ai)22t2n2),

P(∣X‾−E[X‾]∣≥t)≤2exp⁡(−2t2n2∑i=1n(bi−ai)2),⁣{\displaystyle \mathbb {P} (|{\overline {X}}-\mathbb {E} [{\overline {X}}]|\geq t)\leq 2\exp \left(-{\frac {2t^{2}n^{2}}{\sum _{i=1}^{n}(b_{i}-a_{i})^{2}}}\right),\!}P(∣X−E[X]∣≥t)≤2exp(−∑i=1n(bi−ai)22t2n2),

其证明可参考链接如下：
http://web.eecs.umich.edu/~cscott/past_courses/eecs598w14/notes/03_hoeffding.pdf

解释

霍夫不等式经常用于一些独立分布的伯努利随机变量的重要特例中，这也是为什么这个不等式在计算机科学以及组合数学中如此常见。我们认为一个抛硬币时一个硬币 AAA 面朝上的概率为ppp，BBB面朝上的概率则为1−p1-p1−p。我们抛nnn次硬币，那么AAA面朝上次数的期望值为 npnpnp。那么进一步我们可以知道，AAA面朝上的次数不超过 kkk 次的概率能够被下面的表达式：

P(H(n)≤k)=∑i=0k(ni)pi(1−p)n−i\mathbb{P}(H(n) \leq k)=\sum_{i=0}^{k}\left(\begin{array}{c} n \\ i \end{array}\right) p^{i}(1-p)^{n-i}P(H(n)≤k)=i=0∑k(ni)pi(1−p)n−i

这里的H(n)H(n)H(n)为抛nnn次硬币其AAA面朝上的次数。

https://en.wikipedia.org/wiki/Hoeffding%27s_inequality

参考

Mohri, M., Rostamizadeh, A., & Talwalkar, A. (2018).Foundations of machine learning. MIT press.
https://zhuanlan.zhihu.com/p/66799567