sklearn实战-乳腺癌细胞数据挖掘(博主亲自录视频)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

医药统计项目QQ:231469242

fisher's exact test算法来自超几何分布

python代码

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.fisher_exact.html

在线工具

graphpad公司已经把桌面软件做成网页版本。fisher's exact test和卡方类似,但准确性更高,因为卡方分析的p是估计值,不是准确值。除非某人指定你用卡方分析,否则用fisher's exact test。推荐用双边检查。

http://graphpad.com/quickcalcs/contingency1/

计算结果

维基百科解释

https://en.wikipedia.org/wiki/Fisher's_exact_test

Fisher's exact test is a statistical significance test used in the analysis of contingency tables.[1][2][3] Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis (e.g., P-value) can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests.

Fisher is said to have devised the test following a comment from Dr. Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup. He tested her claim in the "lady tasting tea" experiment.[4]

Purpose and scope

A teapot, a creamer and teacup full of tea with milk—can a taster tell if the milk went in first?

The test is useful for categorical data that result from classifying objects in two different ways; it is used to examine the significance of the association (contingency) between the two kinds of classification. So in Fisher's original example, one criterion of classification could be whether milk or tea was put in the cup first; the other could be whether Dr. Bristol thinks that the milk or tea was put in first. We want to know whether these two classifications are associated—that is, whether Dr. Bristol really can tell whether milk or tea was poured in first. Most uses of the Fisher test involve, like this example, a 2 × 2 contingency table. The p-value from the test is computed as if the margins of the table are fixed, i.e. as if, in the tea-tasting example, Dr. Bristol knows the number of cups with each treatment (milk or tea first) and will therefore provide guesses with the correct number in each category. As pointed out by Fisher, this leads under a null hypothesis of independence to a hypergeometric distribution of the numbers in the cells of the table.

With large samples, a chi-squared test can be used in this situation. However, the significance value it provides is only an approximation, because the sampling distribution of the test statistic that is calculated is only approximately equal to the theoretical chi-squared distribution. The approximation is inadequate when sample sizes are small, or the data are very unequally distributed among the cells of the table, resulting in the cell counts predicted on the null hypothesis (the “expected values”) being low. The usual rule of thumb for deciding whether the chi-squared approximation is good enough is that the chi-squared test is not suitable when the expected values in any of the cells of a contingency table are below 5, or below 10 when there is only one degree of freedom (this rule is now known to be overly conservative[5]). In fact, for small, sparse, or unbalanced data, the exact and asymptotic p-values can be quite different and may lead to opposite conclusions concerning the hypothesis of interest.[6][7] In contrast the Fisher exact test is, as its name states, exact as long as the experimental procedure keeps the row and column totals fixed, and it can therefore be used regardless of the sample characteristics. It becomes difficult to calculate with large samples or well-balanced tables, but fortunately these are exactly the conditions where the chi-squared test is appropriate.

For hand calculations, the test is only feasible in the case of a 2 × 2 contingency table. However the principle of the test can be extended to the general case of an m × n table,[8][9] and some statistical packages provide a calculation (sometimes using a Monte Carlo method to obtain an approximation) for the more general case.[10]

Example

For example, a sample of teenagers might be divided into male and female on the one hand, and those that are and are not currently studying for a statistics exam on the other. We hypothesize, for example, that the proportion of studying individuals is higher among the women than among the men, and we want to test whether any difference of proportions that we observe is significant. The data might look like this:

     Men      Women   Row total
Studying 1 9 10
  Not-studying   11 3 14
Column total 12 12 24

The question we ask about these data is: knowing that 10 of these 24 teenagers are studiers, and that 12 of the 24 are female, and assuming the null hypothesis that men and women are equally likely to study, what is the probability that these 10 studiers would be so unevenly distributed between the women and the men? If we were to choose 10 of the teenagers at random, what is the probability that 9 or more of them would be among the 12 women, and only 1 or fewer from among the 12 men?

Before we proceed with the Fisher test, we first introduce some notations. We represent the cells by the letters a, b, c and d, call the totals across rows and columns marginal totals, and represent the grand total by n. So the table now looks like this:

     Men      Women   Row Total
Studying a b a + b
  Non-studying   c d c + d
Column Total a + c b + d a + b + c + d (=n)

Fisher showed that the probability of obtaining any such set of values was given by the hypergeometric distribution:

p = ( a + b a ) ( c + d c ) ( n a + c ) = ( a + b ) !   ( c + d ) !   ( a + c ) !   ( b + d ) ! a !     b !     c !     d !     n ! {\displaystyle p={\frac {\displaystyle {{a+b} \choose {a}}\displaystyle {{c+d} \choose {c}}}{\displaystyle {{n} \choose {a+c}}}}={\frac {(a+b)!~(c+d)!~(a+c)!~(b+d)!}{a!~~b!~~c!~~d!~~n!}}}

where ( n k ) {\displaystyle {\tbinom {n}{k}}} is the binomial coefficient and the symbol ! indicates the factorial operator. With the data above, this gives:

p = ( 10 1 ) ( 14 11 ) / ( 24 12 ) = 10 !   14 !   12 !   12 ! 1 !   9 !   11 !   3 !   24 ! ≈ 0.001346076 {\displaystyle p={{\tbinom {10}{1}}{\tbinom {14}{11}}}/{\tbinom {24}{12}}={\tfrac {10!~14!~12!~12!}{1!~9!~11!~3!~24!}}\approx 0.001346076}

The formula above gives the exact hypergeometric probability of observing this particular arrangement of the data, assuming the given marginal totals, on the null hypothesis that men and women are equally likely to be studiers. To put it another way, if we assume that the probability that a man is a studier is P, the probability that a woman is a studier is p, and we assume that both men and women enter our sample independently of whether or not they are studiers, then this hypergeometric formula gives the conditional probability of observing the values a, b, c, d in the four cells, conditionally on the observed marginals (i.e., assuming the row and column totals shown in the margins of the table are given). This remains true even if men enter our sample with different probabilities than women. The requirement is merely that the two classification characteristics—gender, and studier (or not)—are not associated.

For example, suppose we knew probabilities P , Q , p , q {\displaystyle P,Q,p,q} with P + Q = p + q = 1 {\displaystyle P+Q=p+q=1} such that (male studier, male non-studier, female studier, female non-studier) had respective probabilities ( P p , P q , Q p , Q q ) {\displaystyle (Pp,Pq,Qp,Qq)} for each individual encountered under our sampling procedure. Then still, were we to calculate the distribution of cell entries conditional given marginals, we would obtain the above formula in which neither p nor P occurs. Thus, we can calculate the exact probability of any arrangement of the 24 teenagers into the four cells of the table, but Fisher showed that to generate a significance level, we need consider only the cases where the marginal totals are the same as in the observed table, and among those, only the cases where the arrangement is as extreme as the observed arrangement, or more so. (Barnard's test relaxes this constraint on one set of the marginal totals.) In the example, there are 11 such cases. Of these only one is more extreme in the same direction as our data; it looks like this:

     Men      Women   Row Total
Studying 0 10 10
  Non-studying   12 2 14
Column Total 12 12 24

For this table (with extremely unequal studying proportions) the probability is p = ( 10 0 ) ( 14 12 ) / ( 24 12 ) ≈ 0.000033652 {\displaystyle {p={\tbinom {10}{0}}{\tbinom {14}{12}}}/{\tbinom {24}{12}}\approx 0.000033652} .

In order to calculate the significance of the observed data, i.e. the total probability of observing data as extreme or more extreme if the null hypothesis is true, we have to calculate the values of p for both these tables, and add them together. This gives a one-tailed test, with p approximately 0.001346076 + 0.000033652 = 0.001379728. For example, in the R statistical computing environment, this value can be obtained as fisher.test(rbind(c(1,9),c(11,3)), alternative="less")$p.value. This value can be interpreted as the sum of evidence provided by the observed data—or any more extreme table—for the null hypothesis (that there is no difference in the proportions of studiers between men and women). The smaller the value of p, the greater the evidence for rejecting the null hypothesis; so here the evidence is strong that men and women are not equally likely to be studiers.

For a two-tailed test we must also consider tables that are equally extreme, but in the opposite direction. Unfortunately, classification of the tables according to whether or not they are 'as extreme' is problematic. An approach used by the fisher.test function in R is to compute the p-value by summing the probabilities for all tables with probabilities less than or equal to that of the observed table. In the example here, the 2-sided p-value is twice the 1-sided value—but in general these can differ substantially for tables with small counts, unlike the case with test statistics that have a symmetric sampling distribution.

As noted above, most modern statistical packages will calculate the significance of Fisher tests, in some cases even where the chi-squared approximation would also be acceptable. The actual computations as performed by statistical software packages will as a rule differ from those described above, because numerical difficulties may result from the large values taken by the factorials. A simple, somewhat better computational approach relies on a gamma function or log-gamma function, but methods for accurate computation of hypergeometric and binomial probabilities remains an active research area.

Controversies

Despite the fact that Fisher's test gives exact p-values, some authors have argued that it is conservative, i.e. that its actual rejection rate is below the nominal significance level.[11][12][13] The apparent contradiction stems from the combination of a discrete statistic with fixed significance levels.[14][15] To be more precise, consider the following proposal for a significance test at the 5%-level: reject the null hypothesis for each table to which Fisher's test assigns a p-value equal to or smaller than 5%. Because the set of all tables is discrete, there may not be a table for which equality is achieved. If α e {\displaystyle \alpha _{e}} is the largest p-value smaller than 5% which can actually occur for some table, then the proposed test effectively tests at the α e {\displaystyle \alpha _{e}} -level. For small sample sizes, α e {\displaystyle \alpha _{e}} might be significantly lower than 5%.[11][12][13] While this effect occurs for any discrete statistic (not just in contingency tables, or for Fisher's test), it has been argued that the problem is compounded by the fact that Fisher's test conditions on the marginals.[16] To avoid the problem, many authors discourage the use of fixed significance levels when dealing with discrete problems.[14][15]

The decision to condition on the margins of the table is also controversial.[17][18] The p-values derived from Fisher's test come from the distribution that conditions on the margin totals. In this sense, the test is exact only for the conditional distribution and not the original table where the margin totals may change from experiment to experiment. It is possible to obtain an exact p-value for the 2x2 table when the margins are not held fixed. Barnard's test, for example, allows for random margins. However, some authors[14][15][18] (including, later, Barnard himself)[14] have criticized Barnard's test based on this property. They argue that the marginal success total is an (almost[15]) ancillary statistic, containing (almost) no information about the tested property.

The act of conditioning on the marginal success rate from a 2x2 table can be shown to ignore some information in the data about the unknown odds ratio.[19] The argument that the marginal totals are (almost) ancillary implies that the appropriate likelihood function for making inferences about this odds ratio should be conditioned on the marginal success rate.[19] Whether this lost information is important for inferential purposes is the essence of the controversy.[19]

Alternatives

An alternative exact test, Barnard's exact test, has been developed and proponents of it suggest that this method is more powerful, particularly in 2 × 2 tables. Another alternative is to use maximum likelihood estimates to calculate a p-value from the exact binomial or multinomial distributions and reject or fail to reject based on the p-value.[citation needed]

For stratified categorical data the Cochran–Mantel–Haenszel test must be used instead of Fisher's test.

Choi et al.[19] propose a p-value derived from the likelihood ratio test based on the conditional distribution of the odds ratio given the marginal success rate. This p-value is inferentially consistent with classical tests of normally distributed data as well as with likelihood ratios and support intervals based on this conditional likelihood function. It is also readily computable.[20]

See also

  • Barnard's exact test
  • Chi squared test

http://blog.sina.com.cn/s/blog_5ecfd9d90100oa4c.html

并没有自己写,看了一篇小短文,觉得不错,直接拿来了。
引自----應數博黃士峰
費雪(R. A. Fisher, 1890-1962, 英國統計學家)在他1935 年發表的一篇文章
中提到一個有趣的實驗。有一天喝下午茶時,費雪的一位女同事說到,下午茶的
調製順序對其風味有很大的影響,把茶加進牛奶裡或者是把牛奶倒入茶中,兩者
喝起來口感完全不同,她可以完全地分辨出來,而她個人的偏好是先放牛奶再加
入茶。為了證實這位女同事的說法,費雪做了ㄧ個小實驗。首先他調製了8 杯下
午茶,並且告訴女同事其中4 杯是先放牛奶再加入茶,另外4 杯則是先倒茶再加
入牛奶,接著費雪隨機的拿了一杯茶讓女同事試喝,並請她猜猜這杯茶中是先放
牛奶還是先放茶,等女同事回答後,再隨機的拿第二杯請她試喝,直到全部試喝
完畢。假設女同事的答案如下表:

猜測先放的飲料
實際先放的飲料        牛奶        茶            合計
     牛奶                   a = 3      b = 1         a+b = 4
     茶                      c = 1       d = 3         c+d = 4
    合計                   a+c = 4   b+d = 4     n = 8
由表中得知費雪的女同事猜對了6 杯,但這是否只是碰巧她當天運氣好呢?能否
以統計方法讓我們由實驗的數據判斷費雪的女同事能否真的分辨出兩種不同的
下午茶調製方法?另一個值得注意的事情是,在這個實驗中樣本數非常的小,是
否有適當的統計方法可以幫助我們做出統計推論呢?
        對於以上的問題, 費雪在1935 的文章中提出一個小樣本的檢定方法,稱為
Fisher’s Exact Test。這個方法的前提是固定邊際分布,也就是a+b 、c+d、a+c、
與b+d 的值不變,接著計算在此假設下,得到實際觀測值的機率:
p(a=3|a+b=c+d=a+c=b+d=4)=0.229

其中P(A | B) 是在B 事件發生的前提下A 事件發生的條件機率.
。這個數值的意義是如果費雪的女同事只是隨便亂猜的話,最後
會得到上述表格中結果的機率為0.229。接著我們可更進一步的算出比表格中更
極端情況( 在此指費雪的女同事猜得更準時) 的機率:
P(a = 4 | a + b = c + d = a + c = b + d = 4) = 0.014 , 因此我們可以再計算出
P(a ≥ 3 | a + b = c + d = a + c = b + d = 4) = 0.229 + 0.014 = 0.243 ,此即為統計上單
邊檢定的p 值(p-value)。如果此p 值小於型一誤的機率(the probability of type one
error,一般常設定為0.05 或0.01),則我們可以說有充分的證據支持費雪的女同
事所言不虛,反之則是證據仍不足以證實她的說法。費雪在1935 年的文章中並
沒有告訴我們當時這位女同事到底猜對了幾杯,但根據費雪女兒後來的說法是,
這位女同事當時全答對了(Agresti 2002, p.92)。
上述的小故事看起來好像只是一則有趣的實驗,對統計的實際應用有什麼影
響呢?實際上,Fisher’s Exact Test 不僅簡單實用,尤其適用於樣本數小的情況,
而且這個方法並不需要假設資料母體的分布,換句話說,Fisher’s Exact Test 屬於
一種無母數的檢定方法。以下再舉幾個實際應用的例子:
一、 現在大學生流行節食,一般大眾的猜測是:大學女生節食的比例比男生高。
因此我們設定的虛無假設為H0:大學女生與男生節食的比例相同,對立假
設為Ha:大學女生節食的比例比男生高。倘若經過調查後我們得到以下的
資料:
          男生          女生    合計
節食    a = 1        b = 9    a+b = 10
未節食  c = 11     d = 3    c+d = 14
合計     a+c = 12  b+d = 12  n = 24
則我們可根據上表中的資料算出單邊檢定的p 值
0.0014.

若設定型一誤的機率為0.01,則我們可根據此p 值推論:上表中的資料提
供了足夠的證據拒絕虛無假設,並支持對立假設,即大學女生節食的比例
比男生高。
二、 某家醫院統計了罹患某種罕見疾病時,男性與女性的死亡與存活資料如下:
男性 女性 合計
存活 a = 9 b = 4 a+b = 13
死亡 c = 1 d = 10 c+d = 11
合計 a+c = 10 b+d = 14 n = 24
根據上表,女性死亡率似乎遠比男性為高,因此設定虛無假設為H0:女性
與男性患者死亡率相同,對立假設為Ha:女性患者死亡率較男性高。則我
們可根據上表中的資料算出單邊檢定的p 值
0.0042.
若設定型一誤的機率為0.01,則我們可根據此p 值推論:女性患者死亡率
較男性高。
參考文獻
Agresti, A. (2002). Categorical Data Analysis. John Wiley, New Jersey.
Fisher, R. A. (1935). The Design of Experiments. Oliver & Boyd, Edinburgh.
另外,wiki上解释的也很好:
Fisher's exact test[1][2][3] is a statistical significance test used in the analysis of contingency tables where sample sizes are small. It is named after its inventor, R. A. Fisher, and is one of a class of exact tests, so called because the significance of the deviation from a null hypothesis can be calculated exactly, rather than relying on an approximation that becomes exact in the limit as the sample size grows to infinity, as with many statistical tests. Fisher is said to have devised the test following a comment from Muriel Bristol, who claimed to be able to detect whether the tea or the milk was added first to her cup; see lady tasting tea.

练习
Atlantic出现8只鲸鱼和1只鲨鱼,Indian出现2只鲸鱼和5只鲨鱼。因为频率数1和2小于5,所以用Fisher exact test比卡方更准确
python代码测试,Fisher exact test比卡方更准确
import scipy.stats as stats
stats.fisher_exact(obs)
Out[3]: (20.0, 0.034965034965034919)stats.chi2_contingency(obs)
Out[4]:
(3.8095238095238093, 0.050961936967763424, 1, array([[ 5.625,  4.375],[ 3.375,  2.625]]))

虽然频率小于5,但有时卡方和fish exact test结果一致
     Men      Women   Row total
Studying 1 9 10
  Not-studying   11 3 14
Column total 12 12 24
obs = np.array([[1, 9], [11, 3]])stats.fisher_exact(obs)
Out[7]: (0.030303030303030304, 0.002759456185220084)stats.chi2_contingency(obs)
Out[8]:
(8.4000000000000004, 0.0037522101008738498, 1, array([[ 5.,  5.],[ 7.,  7.]]))

猜測先放的飲料
實際先放的飲料        牛奶        茶            合計
     牛奶                   a = 3      b = 1         a+b = 4
     茶                      c = 1       d = 3         c+d = 4
    合計                   a+c = 4   b+d = 4     n = 8

卡方和Fisher检测表示都无显著关系,历史上夫人全部猜对
obs = np.array([[3,1], [1, 3]])stats.fisher_exact(obs)
Out[14]: (9.0, 0.48571428571428543)stats.chi2_contingency(obs)
Out[15]:
(0.5, 0.47950012218695337, 1, array([[ 2.,  2.],[ 2.,  2.]]))

男生          女生    合計
節食    a = 1        b = 9    a+b = 10
未節食  c = 11     d = 3    c+d = 14
合計     a+c = 12  b+d = 12  n = 24

卡方和Fisher检测表示都有显著关系

obs = np.array([[1,9], [11, 3]])stats.fisher_exact(obs)
Out[17]: (0.030303030303030304, 0.002759456185220084)stats.chi2_contingency(obs)
Out[18]:
(8.4000000000000004, 0.0037522101008738498, 1, array([[ 5.,  5.],[ 7.,  7.]]))

python风控评分卡建模和风控常识

https://study.163.com/course/introduction.htm?courseId=1005214003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

fisher's exact test相关推荐

  1. Fisher's exact test( 费希尔精确检验)

    Fisher's exact test[1][2][3] is a statistical significance test used in the analysis ofcontingency t ...

  2. 知识小结------数据分析------Fisher‘s exact test(费希尔检测)

    系列知识小结目录 Cox比例风险回归模型(proportional hazards model) Fisher's exact test费希尔检测 系列知识小结目录 前言 一.Fisher's exa ...

  3. R语言Fisher检验的workspace问题

    问题描述: stage<-matrix(c(9,14,2,0,4,1,6,12,8,5,13,2,7,15,7,4,13,4,1,18,5),3,7) fisher.test(stage) 软件 ...

  4. 抛弃P值,选择更直观的A/B测试!

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货 作者:Dr.Robert Kubler,译者:张峰 在两个选项中做出选择 ...

  5. JGG | 这么漂亮的Venn网络竟然可以一步在线绘制?

    2021年8月2日,JGG在线发表了中国中医科学院黄璐琦院士团队和中国科学院遗传与发育生物学研究所刘永鑫高级工程师合作题为"EVenn: Easy to create repeatable ...

  6. Cell子刊:人类微生物组参考基因集中的单体基因

    文章目录 人类肠道和口腔微生物组中遗传基因全貌 热心肠日报 摘要 主要结果 表1. 文章中名词定义 图1. 综合分析口腔和肠道微生物组 图2.口腔和肠道微生物组的遗传多样性 图3. 口腔和肠道微生物组 ...

  7. NBT:MaPS-seq测序方法揭示肠道微生物空间分布

    文章目录 肠道微生物生物地理学的空间宏基因组特征 热心肠日报 摘要 主要结果 图1 MaPS-seq测序和质量控制 图2:小鼠末端结肠微生物群落分布的空间结构 图3:微生物群落在小鼠肠道空间分布调查 ...

  8. R的一些统计分析包工具

    [另附]:R语言简明笔记系列 1.1. 描述性统计分析 1.1.1. 描述性统计量的计算 1.1.1.1. summary() > vars<-c("mpg",&quo ...

  9. 【ACR2015】依那西普按需维持治疗策略有效抑制RA骨破坏进展

    标签: 类风湿关节炎; 依那西普; 药物减停; 复发重治 对RA疾病复发患者, 依那西普按需治疗与持续足剂量治疗是否存在疗效差异? Inui K, et al. ACR 2015. Presentat ...

最新文章

  1. BIG T 下学期选修_python作业
  2. android6.0源码分析之AMS服务源码分析
  3. C++如何使用puff()的示例
  4. mysql中as用法
  5. orb-slam2 代码逻辑梳理
  6. java2组随机数的共通数_java随机数产生-指数分布 正态分布 等
  7. Microsoft Visio 2010 简体中文版官方版
  8. apt 安装软件出现“无法定位软件包”的问题
  9. 给设计团队管理者的6个建议
  10. DirectX12(D3D12)基础教程(十七)——让小姐姐翩翩起舞(3D骨骼动画渲染【1】)
  11. mysql库函数说明_MySQL 数据库函数库
  12. deepin20.6设置默认的root密码
  13. 开源自助建站系统源码完整源码+搭建教程 傻瓜式一键建站系统源码
  14. Oracle统计分析
  15. AutoCAD 2008运行提示正在验证许可解决办法
  16. 为特斯拉车主构思设计的一款刹车踩踏数据监测器
  17. Nginx 反向代理的知识再温习一下
  18. 博途中用的是c吗_S7-1500系列博途中使用SCL语言编程方法简介
  19. 腾讯、阿里场外“旁观”,谁将杀进千亿美元SaaS圈?
  20. 数字信号处理课程设计:语音信号采集与滤波处理系统设计与实现 (MATLAB)——(一)

热门文章

  1. 21世纪科技生态面临第三次全球标准
  2. Hinton口中破解宇宙终极秘密的GPT-3厉害在哪?这有篇涂鸦详解
  3. 追加10亿!腾讯宣布设立15亿元“战疫基金”
  4. 亚马逊、谷歌和微软寸土必争的新战场
  5. 深度丨建立合资公司,深度参与运营:详解景驰的无人驾驶生意经
  6. 内卷时代,互联网人相亲有多难?|漫画
  7. 远程办公要降薪?谷歌带头:最高下降 25%
  8. 程序员出身,身价 340 亿!没有他,可能我们刷不了 B 站
  9. Luogu P4859「已经没有什么好害怕的了」
  10. Fault,Error与Failure的联系与区别