最近看论文,看到了Post hoc analy和Bonferroni correction,也不知道这是个啥,就学习一下,记录一下笔记,方便以后查阅。

Planned Contrasts and Post hoc Tests & 多重检验校正

  • Analysis of Variance
  • Planned Contrasts and Post hoc Tests
    • Planned contrasts
      • 概念
      • 计算步骤
      • multiple planned contrasts
    • Post hoc tests
      • 概念
      • multiple post hoc tests
      • 计算步骤
  • 多重检验校正
    • 假设检验
    • 多重假设检验和FWER, FDR
    • 多重检验矫正
    • FWER和FDR校正
      • Family-wise error rate(FWER)——Bonferroni correction
      • False discovery rate(FDR)——Benjamini–Hochberg procedure
        • False discovery rate
        • Benjamini–Hochberg procedure
  • 参考资料

Analysis of Variance

关于Analysis of Variance的一些信息在这里也有记录:Normal Distribution & Chi-squared Distribution & t distribution & F-distribution


a one-way ANOVA includes one factor, whereas a two-way ANOVA includes two factors.

the term factor is used to designate a nominal variable, or in the case of an experimental design, the independent variable, that designates the groups being compared. If we have a drug trial in which we are comparing the mean pain scores of patients after receiving placebo, a low dose of the drug, or a high dose of the drug, the factor would be “drug dose.”

the term levels refers to the individual conditions or values that make up a factor. In our drug trial example, we have three levels of drug dose: placebo, low dose, and high dose.

So how is this ANOVA thing different from the t-tests we already learned? Well, in fact, you can think of it as an extension of the t-test to more than 2 groups. If you run an ANOVA on just 2 groups, the results are equivalent to the t-test. The only difference is that you get an F-value instead of a t-value.


Planned Contrasts and Post hoc Tests

Planned contrasts and post-hoc tests are commonly performed following Analysis of Variance.

This is necessary in many instances, because ANOVA compares all individual mean differences simultaneously, in one test (referred to as an omnibus test).

If we run an ANOVA hypothesis test, and the F-test comes out significant, this indicates that at least one among the mean differences is statistically significant.

However, when the factor has more than two levels, it does not indicate which means differ significantly from each other.

In this example, a significant F-test result from a one-way ANOVA with the three drug dose conditions does not tell us where the significant difference lies.
Is it between 0 and 100 mg? Or between 100 and 200 mg? Or is it only the biggest difference that is significant – 0 vs. 200 mg?

Planned contrasts and post hoc tests are additional tests to determine exactly which mean differences are significant, and which are not.

Why is that we cannot just do 3 independent means t-tests here? Each time we conduct a t-test we have a certain risk of a Type I error. If we do 3, we have triple the risk.

So first we test for omnibus significance using the overall ANOVA as detailed in the first part of this chapter.
Then, if a statistically significant difference exists among the means, we do the pairwise comparisons with an adjustment to be more conservative.

These follow-up tests are designed specifically to avoid inflating risk of Type I error.

Now, this is very important. We are only allowed to conduct these tests if the F-test result was significant.

Planned contrasts


Planned contrasts are used when researchers know in advance which groups they expect to differ.

For example, suppose from our worksheet example, we expect the pop group to differ from the classical group on our measure of working memory. We can then conduct a single comparison between these means without worrying about Type I error.

Because we hypothesized this difference before we saw the data, perhaps based on prior research studies or a strong intuitive hunch, and because there is only one comparison to be analyzed, we need not be concerned about inflated experimentwise alpha.
If multiple comparisons are planned, then we will need to adjust the significance level.


Let us take a look at how to conduct a single planned contrast. The process is quite simple, as it is just a modified ANOVA analysis.

First we calculate SSB with just those two groups involved in the planned contrast. We figure out the degrees of freedom between using just the two groups.

Then, we calculate the variance between using the new SSB and degrees of freedom, and we calculate an F-test for the comparison using the new variance between and the original overall variance within.

To find out if the F-test result is significant, we can use the new degrees of freedom but the original significance level for the cutoff. (Because there is just one pairwise comparison, we can use original significance level.)

multiple planned contrasts

If we were to perform multiple planned contrasts, things change a little.

Suppose we had hypothesized in this experiment that each group would differ from the others?

The Bonferroni correction involves adjusting the significance level to protect from the inflation of risk of Type I error.

The procedure for each comparison is the same as for a single planned contrast. The difference is that the cutoff score to determine statistical significance will use a more conservative significance level.

When we do multiple pairwise comparisons, the Bonferroni correction is to use the original significance level divided by number of planned contrasts.

Post hoc tests


What about post hoc tests tests?

As the name suggests, these tests come into the picture when we are doing pairwise comparisons (usually all possible combinations) after the fact to find out where the significant differences were.

These are tests that do not require that we had an a priori hypothesis ahead of data collection.

Essentially, these are an allowable and acceptable form of data-snooping.

multiple post hoc tests

This is where we must be cautious about doing so many tests – we could end up with huge risk of Type I error.

If we use the Bonferroni correction that we saw for multiple planned comparisons on more than 3 tests, the significance level would be vanishingly small.
This would make it nearly impossible to detect significant differences.

For this reason, slightly more forgiving tests like Scheffe’s correction, Dunn’s or Tukey’s post-hoc tests are more popular.

There are many different post-hoc tests out there, and the choice of which one researchers use is often a matter of convention in their area of research.


Now we shall take a look at how to conduct post hoc tests using Scheffé’s correction.

In this example, we will test all pairwise comparisons.

The Scheffé technique involves adjusting the F-test result, rather than adjusting the significance level.

The way it works is the same as the planned contrast procedure, except for the very end.

Before we compare the F-test result to the cutoff score, we divide the F value by the overall degrees of freedom between, or the number of groups minus one.

Thus, we keep the significance level at the original level, but divide the calculated F by overall degrees of freedom between from the overall ANOVA.



假设检验的相关内容在这里也记录过:Normal Distribution & Chi-squared Distribution & t distribution & F-distribution


假设检验的基本方法是提出一个空假设(null hypothesis),也叫做原假设,记作H0H_0H0​ ;然后得出感兴趣的备择假设(alternative hypothesis),记作H1H_1H1​或HAH_AHA​ 。



H0H_0H0​ :某人没病(我们不感兴趣);H1H_1H1​ :某人有病(我们感兴趣)。
将这些身体指标数据和已确定的或健康或有病的一些人的身体数据等样本信息比较,计算ppp值,一般指定显著性水平α=0.05α = 0.05α=0.05,如果ppp值小于0.050.050.05,表示这是一个小概率事件。根据小概率思想,我们与其相信这个小概率事件的发生,不如认为更为合理的选择是拒绝原假设,认为该人有病;否则无法拒绝原假设,即接受原假设,表示没有足够的证据认为该人有病。


  1. 统计显著性:空假设为真的情况下拒绝零假设所要承担的风险水平,又叫概率水平。
  2. ppp值:假定空假设为真的情况下,得到相同样本结果或更极端结果的概率,是一个用来衡量统计显著性的重要指标。
  3. 显著性水平ααα:空假设为真时,错误地拒绝空假设的概率。另外,也可以把这种概率理解成在假设检验中决策所面临的风险。
  4. 比起计算ppp值,也可以计算统计量,根据显著性水平判断统计量是否落入拒绝域,进而决定是否拒绝原假设。统计量没有ppp直观,所以采用ppp值进行表述。


  • Type I error,I类错误,也叫做ααα错误
  • Type II error,II类错误,也叫做βββ错误
  • FP: false positive,假正例,I类错误
  • FN: false negative,假反例,II类错误
  • TP: true positive,真正例
  • TN: true negative,真反例




多重假设检验和FWER, FDR


  • mmm 表示假设检验的个数
  • m0m_{0}m0​表示空假设为真的个数
  • m−m0m-m_0m−m0​表示备择假设为真的个数
  • VVV 表示假正例的个数
  • SSS 表示真正例的个数
  • UUU 表示真反例的个数
  • TTT 表示假反例的个数
  • R=V+SR=V+SR=V+S表示拒绝空假设的个数


所以R=V+SR=V+SR=V+S表示发现的个数,VVV表示错误发现(false discovery)的个数,SSS表示正确发现(true discovery)的个数。




因为在mmm检验中,V,S,U,TV,S,U,TV,S,U,T都是随机变量,所以FDRFDRFDR需要用期望的形式来表示。另外,如果R=0R=0R=0,认为Q=0Q=0Q=0。为了包含这种情况,FDR=E[V/R∣R>0]⋅P{R>0}FDR=E[V/R|R>0] \cdot P\{R>0\}FDR=E[V/R∣R>0]⋅P{R>0}。通俗地理解,可以认为FDR=Q=V/R=V/(V+S)FDR=Q=V/R=V/(V+S)FDR=Q=V/R=V/(V+S)。





FWER校正有多种实现,其中最经典的是Bonferroni correction;FDR校正也有多种实现,其中最经典的就是Benjamini–Hochberg procedure




举个实际的例子,假如有一种诊断艾滋病的试剂,试验验证其准确性为99%(每100次诊断就有一次false positive)。对于一个被检测的人来说(single test),这种准确性足够了。但对于医院来说(multiple test),这种准确性远远不够,因为每诊断10000个人,就会有100个非艾滋病病人被误诊为艾滋病。这显然是不能接受的。所以,对于多重检验,如果不进行任何控制,犯一类错误的概率便会随着假设检验的个数迅速增加。



FWER显得较为保守,它主要是依靠减少假阳性(I类错误)的个数,同时也会减少TDR(true discovery rate)。




条件:在mmm次多重假设检验中,每一次的空假设记为H1,H2,...,HmH_1, H_2, ..., H_mH1​,H2​,...,Hm​,对应ppp值记为p1,p2,...,pmp_1, p_2, ..., p_mp1​,p2​,...,pm​,设定显著性水平ααα。

Family-wise error rate(FWER)——Bonferroni correction

邦费罗尼校正(英语:Bonferroni correction)是统计学中在多重比较时使用的一种校正方法,以意大利数学家卡罗·埃米利奥·邦费罗尼的名字命名。

令H1,…,Hm{\displaystyle H_{1},\ldots ,H_{m}}H1​,…,Hm​为一组假设,p1,…,pm{\displaystyle p_{1},\ldots ,p_{m}}p1​,…,pm​为每一假设相对应的ppp值。同时,mmm为零假设总数,m0{\displaystyle m_{0}}m0​则为实际为真的零假设总数。族错误率(familywise error rate,简称FWER)指拒绝至少一个实际为真的零假设(即出现至少一次第一类错误)的概率。此时,邦费罗尼校正是指拒绝所有pi≤αm{\displaystyle p_{i}\leq {\frac {\alpha }{m}}}pi​≤mα​的零假设。在应用邦费罗尼校正后,FWER满足FWER≤α{\displaystyle {\text{FWER}}\leq \alpha }FWER≤α。这一结论可以由布尔不等式证明:




False discovery rate(FDR)——Benjamini–Hochberg procedure

False discovery rate

假发现率(False discovery rate, FDR)完善了对多重假设测试的检验。

FDR=Qe=E⁣[Q],{\displaystyle \mathrm {FDR} =Q_{e}=\mathrm {E} \!\left[Q\right],}FDR=Qe​=E[Q],其中EEE表示期望,Q=V/R=V/(V+S){\displaystyle Q=V/R=V/(V+S)}Q=V/R=V/(V+S),VVV表示错误拒绝零假设的数目,RRR表示拒绝零假设的数目。RRR取000时FDR直接取000,写成一句话就是FDR=E⁣[V/R∣R>0]⋅P⁣(R>0){\displaystyle \mathrm {FDR} =\mathrm {E} \!\left[V/R|R>0\right]\cdot \mathrm {P} \!\left(R>0\right)}FDR=E[V/R∣R>0]⋅P(R>0)


较之于FWER校正(family-wise error rate),FDR校正程序采用了更为宽松的标准(比如Bonferroni 校正,“一个假阳性也不许”)。所以,FDR校正法在提高一类错误(应接受零假设,却拒绝零假设)的同时,有更好的统计功效。

Benjamini–Hochberg procedure

Benjamini–Hochberg procedure,简称为BH。首先对所有的ppp值从小到大排序,并记作p(1),p(2),...,p(m)p_{(1)},p_{(2)},...,p_{(m)}p(1)​,p(2)​,...,p(m)​,其对应的空假设为H(1),H(2),...,H(m)H_{(1)},H_{(2)},...,H_{(m)}H(1)​,H(2)​,...,H(m)​。








如何通俗地理解Family-wise error rate(FWER)和False discovery rate(FDR)




Family-wise error rate

False discovery rate

Beginner Statistics for Psychology

