贝叶斯 朴素贝叶斯

介绍 (Introduction)

Bayesian analysis offers the possibility to get more insights from your data compared to the pure frequentist approach. In this post, I will walk you through a real life example of how a Bayesian analysis can be performed. I will demonstrate what may go wrong when choosing a wrong prior and we will see how we can summarize our results. For you to follow this post, I assume you are familiar with the foundations of Bayesian statistics and with Bayes' theorem.

与纯频率论方法相比,贝叶斯分析提供了从数据中获得更多见解的可能性。 在本文中,我将向您介绍如何执行贝叶斯分析的真实示例。 我将演示选择错误的先验时可能出问题的地方,我们将看到如何总结我们的结果。 为了让您关注这篇文章,我假设您熟悉贝叶斯统计的基础和贝叶斯定理。

情境 (Scenario)

As an example analysis, we will discuss a real life problem from a physics lab. No worries, you don't need any physics knowledge for that. We want to determine the efficiency of a particle detector. A particle detector is a sensor that may produce a measurable signal when certain particles traverse it. The efficiency of the detector we want to evaluate is the chance that the detector actually measures the traversing particle. In order to measure this, we put the detector that we want to evaluate in between two other sensors in a sandwich-like structure. If we measure a signal in the top and bottom sensors we know that a particle should have also traversed the detector in the middle. A picture of the experimental setup is shown below.

作为示例分析,我们将讨论物理实验室中的现实生活中的问题。 不用担心,您不需要任何物理知识。 我们要确定粒子探测器的效率。 粒子检测器是一种传感器,当某些粒子经过时会产生可测量的信号。 我们要评估的检测器效率是检测器实际测量横越粒子的机会。 为了对此进行测量,我们将要评估的检测器放在其他两个传感器之间,呈三明治状。 如果我们在顶部和底部传感器中测量信号,我们知道粒子也应该在中间穿过检测器。 实验设置的图片如下所示。

For the measurement, we count the number of traversing particles N in a certain time (as reported by the top and bottom sensors) as well as the number of signals measured in our detector r. For this example, we assume N=100 and r=98.

为了进行测量,我们计算了一定时间内(由顶部和底部传感器报告的)遍历粒子N的数量,以及在探测器r中测得的信号数量。 对于此示例,我们假设N = 100r = 98

频频结果 (Frequentist Result)

In a frequentist approach, we could use our measured data and arrive at the conclusion that the efficiency of the detector is e = r/N = 98%. This gives us only a point estimate. If we want to answer more complicated questions, for example: "What is the probability that the efficiency of the detector is above 99%", then we need a more complex analysis.

在常用方法中,我们可以使用我们的测量数据得出结论,即探测器的效率为e = r / N = 98% 。 这仅给我们一个点估计。 如果我们想回答更复杂的问题,例如: “检测器的效率高于99%的概率是多少” ,那么我们需要进行更复杂的分析。

贝叶斯分析 (The Bayesian Analysis)

The goal of the Bayesian approach is to derive the full posterior probability distribution of the efficiency of the detector given our data p(e|D). In order to do so, we need Bayes' theorem:

贝叶斯方法的目标是在给定我们的数据p(e | D)的情况下 ,得出探测器效率的全部后验概率分布 为此,我们需要贝叶斯定理:

Bayes' Theorem
贝叶斯定理

We will go over the different terms in the following.

下面我们将讨论不同的术语。

概率模型/可能性: p(D | e) (Probability Model / Likelihood: p(D|e))

As always in a Bayesian analysis, we need to select a model that describes the process we want to analyse, called the likelihood. For our problem, we can interpret the efficiency as the chance to have a success (r) out of a certain number of trails (N). This class of problems, similar to determining the chance of a coin showing head, can be modeled by the binomial distribution:

与贝叶斯分析一样,我们需要选择一个模型来描述我们要分析的过程,即可能性。 对于我们的问题,我们可以将效率解释为从一定数量的线索( N )中获得成功( r )的机会。 此类问题类似于确定硬币出现正面的机会,可以通过二项分布来建模:

Binomial Distribution
二项分布

先前:p(e) (Prior: p(e))

Next, we need to define a prior. Here, we start with the most trivial choice, a flat prior. We will discuss the influence of a different prior choice later.

接下来,我们需要定义一个先验。 在这里,我们从最简单的选择开始,即优先选择。 稍后,我们将讨论不同的优先选择的影响。

边际可能性:p(D) (Marginal Likelihood: p(D))

The marginal likelihood is the denominator in Bayes' theorem. Luckily it is just a normalization constant and not dependent on the efficiency. We can determine it numerical by finding the constant that normalizes the posterior to 1.

边际可能性是贝叶斯定理中的分母。 幸运的是,这只是一个归一化常数,与效率无关。 我们可以通过找到将后验归一化为1的常数来确定它的数值。

结果 (Results)

Now we can calculate the posterior following Bayes' theorem.

现在我们可以计算遵循贝叶斯定理的后验。

Posterior distribution p(e|D) for N=100, r=98 with a flat prior
N = 100,r = 98,后验分布p(e | D)

You can see that the most probable value is e=98% which is the same as the intuitive frequentist result. But we obtained much more information here, as we got the full posterior probability distribution. For example, we can see that the distribution is asymmetric. An efficiency below 97% has a higher probability than an efficiency above 99%. And to both probabilities, we can assign exact numbers. How did we get this extra information? It is because we took advantage of more information, meaning we have assumed that the behaviour of the detector follows a binomial distribution as well as we assumed a flat prior distribution.

您可以看到最可能的值是e = 98% ,这与直观的常客结果相同。 但是,由于获得了完整的后验概率分布,我们在这里获得了更多的信息。 例如,我们可以看到分布是不对称的。 低于97%的效率比高于99%的效率更高的概率。 对于这两种概率,我们可以分配确切的数字。 我们如何获得这些额外信息? 这是因为我们利用了更多的信息,这意味着我们假设检测器的行为遵循二项式分布,并且假设了先验分布平坦。

先验的影响 (Influence of the Prior)

The prior plays an important role in a Bayesian analysis. In the following, we will see what happens if we change it. Let’s say we find a statement in the datasheet of the detector that the efficiency can be assumed to be gaussian distributed around 98% with a standard deviation of s=1%. In an older version of the datasheet, however, we find that the efficiency of the detector should be Gaussian distributed around 92% with the same standard deviation of s=1%. We incorporate this information into the posterior by changing the priors accordingly. The results for both cases can be seen below.

先验在贝叶斯分析中起重要作用。 在下面,我们将看到如果更改它会发生什么。 假设我们在检测器的数据表中找到一条陈述,即效率可以假定为高斯分布,其标准偏差为s = 1%,约为98 。 但是,在数据表的旧版本中,我们发现检测器的效率应为高斯分布,约为92%,且标准偏差为s = 1% 。 我们通过相应地更改先验将这些信息合并到后验中。 这两种情况的结果都可以在下面看到。

Posterior and prior probabilities for different priors
不同先验的后验概率和先验概率

Here, the posterior is shown in the top panel and the corresponding priors in the panel below. The black curve shows the previous result with the flat prior. When changing the prior to a gaussian one with mean m=98% (green) the posterior peaks again at 98% and the confidence in our estimates are stronger compared to the case with the flat prior. The prior supports our data. While an efficiency below 95% still had a reasonable probability in the case of the flat prior, it is nearly excluded now. Taking the prior from the old data sheet that peaked at an efficiency of 92% (red), we can see that the posterior differs significantly from the other two. The most probable value is around 93%, completely changing our results. How can this be? The problem is that by choosing a wrong prior the data and the prior are not consistent with each other. This example shows, that choosing a wrong prior may have catastrophic consequences. It is important to always evaluate the consistency between the prior, the probability model and the posterior.

在这里,后部显示在顶部面板中,而相应的先验显示在下方面板中。 黑色曲线显示先前的结果,平坦的先验结果。 当将先验者转换为均值m = 98% (绿色)的高斯验算器时,后验峰再次以98%的峰值出现,并且与持平先验者相比,我们的估计信心更大。 先验支持我们的数据。 而效率低于 在之前持平的情况下,仍有95%的人具有合理的可能性,现在几乎将其排除在外。 从旧数据表中的先验数据以92%(红色)的效率达到峰值我们可以看到,后验数据与其他两个数据表明显不同。 最可能的值约为93%,完全改变了我们的结果。 怎么会这样? 问题在于,通过选择错误的先验,数据和先验数据彼此不一致。 此示例表明,选择错误的先验可能会带来灾难性的后果。 始终评估先验概率模型和后验模型之间的一致性很重要。

合并其他度量 (Incorporating Additional Measurements)

Another use case for a prior is an additional measurement. Imagine your colleague measured the same detector. He measured N1=300 and r1=280. How can we correctly make use of this data? We can use it as a prior for our analysis. The results are shown below.

先验的另一个用例是额外的度量。 想象一下您的同事测量了相同的检测器。 他测得N1 = 300r1 = 280 。 我们如何正确利用这些数据? 我们可以将其用作分析的先验条件。 结果如下所示。

Using a previous measurement as a prior
使用先前的测量作为先前的

You can see the posterior distribution of our measurement (black) and the colleague's measurement (blue) both using flat priors. If we use our colleague's measurement as a prior for our analysis, we arrive at the green curve. The most probable value of the green curve is in between the other two curves, but more shifted to the blue curve as our colleague's measurement has more data. Also, the distribution for the green curve is slightly narrower compared to the other two. Side note: The resulting posterior is again a binomial distribution. Moreover, we will arrive at the same posterior as if we would redo the analysis and assume only one measurement with N=N1+N2=400 and r=r1+r2=378. As you would expect it, the results are also independent of the order the two measurements were performed. This can be easily verified analytically.

您可以使用平坦先验值来查看我们的度量的后验分布(黑色)和同事的度量(蓝色)。 如果我们将同事的测量结果作为分析的先验条件,则会得出绿色曲线。 绿色曲线的最可能值在其他两条曲线之间,但是随着我们同事的测量结果具有更多数据,更多地转移到了蓝色曲线。 此外,绿色曲线的分布比其他两条曲线略窄。 旁注 :产生的后验再次是二项分布。 此外,我们将得出相同的后验,就好像我们要重做分析并假设只有一个测量值N = N1 + N2 = 400r = r1 + r2 = 378一样 。 如您所料,结果也与两次测量的执行顺序无关。 可以很容易地进行分析验证。

如何呈现结果 (How to present your results)

After calculating the posterior, we now want to present our results. Ideally, you want to show the full posterior distribution, as this reflects the full information. However, this is not always possible and you may want to summarize it with a set of values. Often you want to give a point estimate along with an interval that summarizes the width of the distribution. There are different ways how to do this. Popular choices include:

在计算后验后,我们现在要展示我们的结果。 理想情况下,您希望显示完整的后验分布,因为这反映了完整的信息。 但是,这并非总是可能的,您可能需要用一组值对其进行总结。 通常,您需要给出一个点估计值以及一个总结分布宽度的间隔。 有不同的方法来执行此操作。 受欢迎的选择包括:

  • Expectation value & standard deviation期望值和标准偏差
  • Median & central interval中位和中心间隔
  • Mode & smallest interval模式和最小间隔

Additionally, we need to select how much probability should be included in the intervals (often used: 68% or 90%).

此外,我们需要选择在间隔中应包含多少概率(通常使用:68%或90%)。

For a normal distribution, all three choices of point estimate and confidence interval give identical results. However, in our case of a skewed distribution this is not the case.

对于正态分布,点估计和置信区间的所有三个选择都给出相同的结果。 但是,在我们的分布偏斜的情况下,情况并非如此。

Different combinations of point estimates and corresponding intervals in order to summarize a posterior
点估计和相应间隔的不同组合,以便总结后验

You can see that all three choices lead to different results. None of these is wrong or correct, it is just important to report exactly what point estimates you used and how you constructed your intervals. Here we could say for example that the most probable value (mode) of our posterior is 0.98 with a confidence interval of 0.962-0.991 (smallest interval including 68% of the probability density).

您会看到所有三个选择导致不同的结果。 这些都不是错误或正确的,重要的是准确报告您使用的点估计以及间隔的构造方式。 在这里我们可以说,例如,我们后验的最可能值(众数)为0.98,置信区间为0.962-0.991(最小区间,包括68%的概率密度)。

结论 (Conclusions)

We performed a full Bayesian analysis starting by setting up a probability model, choosing appropriate priors all the way to summarizing the posterior with a point estimate and a corresponding interval. The advantage of the Bayesian approach is that we gain access to the full posterior probability distribution. This enabled us to elegantly incorporate prior knowledge, as for example the manufacturer's information, or a previous measurement. Furthermore, we saw that the choice of a wrong prior may have a significant influence on our results, highlighting that a careful choice of the prior and an evaluation of its consistency with the probability model and the posterior is of high importance in any Bayesian analysis.

我们从建立概率模型开始,进行了完整的贝叶斯分析,从一开始就选择适当的先验以总结出后验点,并给出点估计和相应的间隔。 贝叶斯方法的优点是我们可以访问全部后验概率分布。 这使我们能够优雅地结合先前的知识,例如制造商的信息或先前的测量。 此外,我们发现选择错误的先验可能会对我们的结果产生重大影响,强调在任何贝叶斯分析中,谨慎选择先验以及评估其与概率模型和后验的一致性都非常重要。

A python notebook producing the numbers and figures can be found here.

可以在此处找到生成数字和数字的python笔记本。

翻译自: https://towardsdatascience.com/performing-a-bayesian-analysis-by-hand-c589ab992916

贝叶斯 朴素贝叶斯


http://www.taodudu.cc/news/show-994956.html

相关文章:

  • GitHub动作简介
  • 照顾好自己才能照顾好别人_您必须照顾的5个基本数据
  • 认识数据分析_认识您的最佳探索数据分析新朋友
  • arima模型怎么拟合_7个统计测试,用于验证和帮助拟合ARIMA模型
  • 天池幸福感的数据处理_了解幸福感与数据(第1部分)
  • 詹森不等式_注意詹森差距
  • 数据分析师 需求分析师_是什么让分析师出色?
  • 猫眼电影评论_电影的人群意见和评论家的意见一样好吗?
  • ai前沿公司_美术是AI的下一个前沿吗?
  • mardown 标题带数字_标题中带有数字的故事更成功吗?
  • 使用Pandas 1.1.0进行稳健的2个DataFrames验证
  • rstudio 关联r_使用关联规则提出建议(R编程)
  • jquery数据折叠_通过位折叠缩小大数据
  • 决策树信息熵计算_决策树熵|熵计算
  • 流式数据分析_流式大数据分析
  • 数据科学还是计算机科学_数据科学101
  • js有默认参数的函数加参数_函数参数:默认,关键字和任意
  • 相似邻里算法_纽约市-邻里之战
  • 数据透视表和数据交叉表_数据透视表的数据提取
  • 图像处理傅里叶变换图像变化_傅里叶变换和图像床单视图。
  • 滞后分析rstudio_使用RStudio进行A / B测试分析
  • unity3d 可视化编程_R编程系列:R中的3D可视化
  • python 数据科学 包_什么时候应该使用哪个Python数据科学软件包?
  • 熊猫tv新功能介绍_您应该知道的4种熊猫绘图功能
  • vs显示堆栈数据分析_什么是“数据分析堆栈”?
  • 广告投手_测量投手隐藏自己的音高的程度
  • python bokeh_提升视觉效果:使用Python和Bokeh制作交互式地图
  • nosql_探索NoSQL系列
  • python中api_通过Python中的API查找相关的工作技能
  • 欺诈行为识别_使用R(编程)识别欺诈性的招聘广告

贝叶斯 朴素贝叶斯_手动执行贝叶斯分析相关推荐

  1. python3中朴素贝叶斯_贝叶斯统计:Python中从零开始的都会都市

    python3中朴素贝叶斯 你在这里 (You are here) If you're reading this, odds are: (1) you're interested in bayesia ...

  2. Sklearn官方文档中文整理6——交叉分解,朴素贝叶斯和决策树篇

    Sklearn官方文档中文整理6--交叉分解,朴素贝叶斯和决策树篇 1. 监督学习 1.8. 交叉分解[cross_decomposition.PLSRegression,cross_decompos ...

  3. 朴素贝叶斯算法应用实例

    https://www.toutiao.com/a6650068891382841859/ 2019-01-24 22:23:40 朴素贝叶斯 朴素贝叶斯中的朴素是指假设各个特征之间相互独立,不会互相 ...

  4. 【机器学习实战】第4章 朴素贝叶斯(Naive Bayes)

    第4章 基于概率论的分类方法:朴素贝叶斯 朴素贝叶斯 概述 贝叶斯分类是一类分类算法的总称,这类算法均以贝叶斯定理为基础,故统称为贝叶斯分类.本章首先介绍贝叶斯分类算法的基础--贝叶斯定理.最后,我们 ...

  5. 使用朴素贝叶斯进行个人信用风险评估

    朴素贝叶斯 朴素贝叶斯方法是基于贝叶斯定理的一组有监督学习算法,即"简单"地假设每对特征之间相互独立. 给定一个类别yyy和一个从x1x_1x1​到xnx_nxn​的相关的特征向量 ...

  6. 朴素贝叶斯应用之在手写数字识别的实践

    文章目录 引言 朴素贝叶斯 朴素贝叶斯法的学习与分类 朴素贝叶斯法的参数估计 极大似然估计 贝叶斯估计 实战朴素贝叶斯 图片预处理 图片数据化 模型训练 模型预测 其他说明 Reference 引言 ...

  7. 细讲逻辑斯蒂回归与朴素贝叶斯、最大熵原理的爱恨交织(长文)

    好早之前就发现逻辑斯蒂回归好像和朴素贝叶斯里面的后验概率公式还有最大似然.信息熵.交叉熵.伯努利分布.回归分析.几率(odds)等等有着千丝万缕CZFZ(错综复杂).PSML(扑朔迷离)的关系.一直感 ...

  8. 细讲逻辑斯蒂回归与朴素贝叶斯、最大熵原理的爱恨交织(五)

    第五节:分类器中的天真小弟 -- 朴素贝叶斯 朴素贝叶斯文本分类模型 考虑如下文本分类模型:P(yi,di)P(y_i, d_i)P(yi​,di​) 表示一篇文章以及它的 label 的联合概率.d ...

  9. 机器学习(二)--sklearn之逻辑斯蒂回归和朴素贝叶斯

    文章目录 1.逻辑斯蒂回归 2.朴素贝叶斯 3.三种分类算法的比较 上回说到,sklearn中的k近邻算法解决多分类问题.k近邻的基本步骤是:收集数据.创建分类器.训练.预测.评估性能.调参(参数就是 ...

最新文章

  1. HDU 1429 胜利大逃亡(续) (BFS+位压缩)
  2. 500多页的机器学习入门笔记,下载超5万次,背后都有什么故事?
  3. 牛课网--走格子(环形遍历数组并且找出指定步数的位置)
  4. 怎么把本地的项目同时提交到两个仓库
  5. 保护企业网络安全,不要忽视数据
  6. boost::log::dynamic_type_dispatcher用法的测试程序
  7. 云原生下,如何实现高可用的MySQL?
  8. 【视频回放与课件】零基础入门AI开发
  9. POJ 1742 Coins ( 经典多重部分和问题 DP || 多重背包 )
  10. REVERSE-PRACTICE-JarvisOJ-1
  11. CMD终端关于pip报错,scrapy报错的一种处理方法
  12. mysql创建表shop_ShopXO商城-支付方式 - 数据库设计 - 数据库表结构 - 果创云
  13. 网站跨站点脚本,Sql注入等攻击的处理
  14. 惊!STM32 蓝牙串口模块(H21/JDY-31) 竟如此简单!
  15. Ubuntu Android开发环境配置
  16. 5G 商用第三年:无人驾驶的“上山”与“下海”
  17. Linux解决Device eth0 does not seem to be present,delaying initialization问题
  18. Chrome插件英雄榜111期更新《Unsplash For Chrome》查找免费无版权超清图并直接插入任意在线编辑器...
  19. 不管SDLC还是Devops,请把好安全质量门
  20. ThinkPHP5.0之PHPmailer发送邮箱(qq、163)

热门文章

  1. 【算法】【殊途同归】搜索算法之(深度优先 || 广度优先) (约束条件 || 限界函数)
  2. 【好文推荐】java模板引擎性能
  3. 企业级项目实战讲解!java的war包能直接改名么
  4. python生成泊松分布随机数_泊松分布随机数
  5. vue 双数据绑定原理
  6. 手把手教你把代码丢入github 中
  7. MVC与三层架构区别
  8. Android第三夜
  9. java 泛型的几点备忘
  10. webservice引用spring的bean